We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Squeezing a go function

00:00

Formal Metadata

Title
Squeezing a go function
Title of Series
Number of Parts
542
Author
Contributors
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Go is a very performant language, so, normally you don't need to optimize it at all. But what happen when you need it? This talk is a walk through some techniques and tools that you can use to squeeze that last drops from your functions. We are going to see tools like pprof or go benchmarks, and explore some of the results. We are going to explore other topis like escape analysis, function inlining or the garbage collector, that all of them can affect the performance of your application.
14
15
43
87
Thumbnail
26:29
146
Thumbnail
18:05
199
207
Thumbnail
22:17
264
278
Thumbnail
30:52
293
Thumbnail
15:53
341
Thumbnail
31:01
354
359
410
SoftwareMultiplication signFunctional programmingSoftwareProgramming languageRoundness (object)Functional (mathematics)DiagramComputer animation
Mathematical optimizationCASE <Informatik>
1 (number)Mathematical optimizationLevel (video gaming)
Mathematical optimizationFunctional (mathematics)Cycle (graph theory)Computer animation
Operator (mathematics)Mathematical optimizationMeasurement
MeasurementMeasurementBenchmarkEngineering drawing
Hash functionSoftware testingCryptographyBenchmarkMountain passIntelCore dumpBefehlsprozessorBlogBenchmarkMeasurementMultiplication signMassProcess (computing)Semiconductor memoryPoint (geometry)Software testingNumberResource allocationMemory managementComputer fileOperator (mathematics)CountingRegulärer Ausdruck <Textverarbeitung>Software frameworkFunctional (mathematics)SpeicherbereinigungLoop (music)Computer configurationPressureComputer animation
Vertex (graph theory)Run time (program lifecycle phase)CurvatureOpen setGraph (mathematics)Installable File SystemComputer fileCoroutineString (computer science)Interior (topology)Profil (magazine)BenchmarkElectronic mailing listSemiconductor memoryComputer fileCASE <Informatik>Line (geometry)Function (mathematics)Computer animationDiagram
Computer fileTotal S.A.CoroutineString (computer science)Interior (topology)Run time (program lifecycle phase)Data typeVertex (graph theory)CurvatureSample (statistics)System callOperator (mathematics)Multiplication signElectronic mailing listRootBefehlsprozessorCausalityResource allocation
Mathematical optimizationProcess (computing)Computer animation
Range (statistics)String (computer science)Reduction of orderBefehlsprozessorSoftware testingBenchmarkIntelCore dumpMountain passBefehlsprozessorBenchmarkProgram slicingResultantMathematical optimizationFunctional (mathematics)Computer animation
Computer multitaskingResource allocationMountain passFunction (mathematics)Core dumpBefehlsprozessorIntelResource allocationMultiplication signResultantOperator (mathematics)InformationProgram slicingBenchmarkBefehlsprozessorApproximationComputer animation
StrutCompilerComputer multitaskingBenchmarkCore dumpIntelBefehlsprozessorMountain passProgram slicingIntegerFunctional programmingCompilerMultiplication signOperator (mathematics)DampingMathematical optimizationData structureInstance (computer science)Semiconductor memoryBefehlsprozessorComputer animation
Function (mathematics)Inclined planeCore dumpBefehlsprozessorIntelMountain passCompilerCodeSystem callFunctional (mathematics)Revision controlDefault (computer science)Computer animation
Escape characterMathematical analysisCompilerInformationMemory managementEscape characterResource allocationContext awarenessMathematical analysisComputer animation
Mathematical analysisEscape characterFunction (mathematics)BefehlsprozessorIntelCore dumpMountain passBenchmarkResource allocationMultiplication signPointer (computer programming)Revision controlFunctional (mathematics)Computer animation
Inclined planeMathematical analysisString (computer science)StrutRange (statistics)Escape characterFunction (mathematics)BefehlsprozessorIntelCore dumpMountain passMathematical analysisEscape characterMultiplication signFunctional (mathematics)Process (computing)Resource allocationConstructor (object-oriented programming)NumberPointer (computer programming)Type theoryOperator (mathematics)LogicComputer animation
Mathematical optimizationPressureDifferent (Kate Ryan album)CASE <Informatik>Concurrency (computer science)BefehlsprozessorCoroutineSpeicherbereinigungComputer animation
SynchronizationLocal GroupConcurrency (computer science)CoroutineSoftware testingRun time (program lifecycle phase)Core dumpBefehlsprozessorMountain passCoroutineNumberBefehlsprozessorCycle (graph theory)Different (Kate Ryan album)Resource allocationProcess (computing)Serial portSemiconductor memoryMultiplication signCASE <Informatik>Functional programmingBenchmarkConcurrency (computer science)Task (computing)ResultantOperator (mathematics)Structural loadFunctional (mathematics)WorkloadHoaxParallel portCore dumpComputer simulationComputer animation
Constructor (object-oriented programming)Program slicingComputer animationDiagram
Medical imagingComputer hardwareComputer animation
Computer animationProgram flowchart
Transcript: English(auto-generated)
Okay, thank you. So our next speaker is Jesus, who has been talking a few times in the Go Dev Room about everything that has to do deeply within the language and today he's gonna talk to us about what's going on in functions. Round of applause. Hello everybody, my name is Jesus, I'm software engineer, I'm going to
talk about squeezing a Go function. So what is optimization? I think it's important to know that optimization is not being faster or consume less memory, it depends on your needs. So it's better fresh, squeezed use,
probably everybody will say yes, but it depends if you are looking for convenience or for something that lasts forever. So in that case it's not the best option. Optimizing is about what you need and trying to address that.
It's important to optimize at the right level. You can buy the best card, you can get a F1 card, and it's not going to be fast if this is the road. So try to optimize always at the upper level, because these kind of optimizations, the ones that we are going to see in this talk, are micro
optimizations that probably are not the first place that you should be starting. Optimize what you need and when you need it. It's not about taking a Go function and try to optimize forever and try to make that run
super efficiently and scratch every single nanosecond, because probably the bottleneck is no longer there. You have to search for the bottleneck, you have to optimize where the bottleneck is, and then look again if the bottleneck is still there, because if it's no longer there, you are over optimizing that function without much gain. So just take
that into consideration. Optimizing is an iterative cycle and you need to keep moving and keep searching for the bottleneck. Do not guess, please. Yeah, I know everybody has instincts and all that stuff, but guessing about
performance is an awful thing, because there's so many things that comes into play that is just impossible. There's the operating system, the compiler, the optimizations of the compiler, if you are in the cloud, maybe a noisy
neighbor, all that stuff comes into play with performance. So you are not good at guessing, almost for sure, in performance. So just measure everything. The important thing here is try to measure everything and work with that data. Actually is what probably the talk that is after the
next one is about. So I will suggest to go there also, because it probably is a very interesting talk. So let's talk about benchmarks. The way that you measure performance in micro-optimizations or micro-benchmarks is through Go benchmarks. Go benchmark is a tool that comes with
Go and is similar to the testing framework that comes in Go, but very focused on benchmarking. In this case, we can see here an example to have two benchmarks, one for MD5-SUM and one for SHA-256-SUM, and that's it.
It's just a function that starts with benchmark and receives a testing.b argument, and I have this for loop inside. And that is going to do all the job to give you the numbers. And I'll show you now the numbers. If I run this with go-test-bench., the dot is a regular expression
that means everything. So you can use the go-test-run regular expression for only executing certain benchmarks. And here you can see that MD5-SUM is around twice the time faster per operation than SHA.
So, well, just the number is that important. It depends. If you need more security, probably MD5 is not the best option. So it depends on your needs. Another interesting thing is the allocations. One thing that you
maybe have heard is about counting allocations. Counting allocations, why is that important? It's because every time we allocate something, when we talk allocation, we're talking about allocation in the heap. If every time we allocate something in the heap, allocating that is going to introduce an overhead. And not only that, it's going to
add more pressure to the garbage collector. That's why it's important to count the allocation when you are talking about performance. If you are not worried about performance at that point, don't count the allocation. It's not that important, and you are not going to gain a massive amount of performance from there if you are not
in that point there. Okay, let's see an example here. In MD5 and SHA-SUMs, we have zero allocations. So, well, this data is not very useful for us now. So let's use another thing. Let's open a file.
Let's open a file thousands of times and see how it goes. Now I see that every single operation of opening a file, just opening the file, is going to generate three allocations, and it's going to consume 120 bytes per operation. Interesting. So now you are measuring things. You are measuring how much time it takes,
how much time is gone in processing something, is gone in allocating things, how much memory is gone there. So let's talk about profiling, because once you, well, actually, normally you do
the profiling first to find your bottleneck, and then you do the benchmark to tune that bottleneck. But I'm playing with the fact that I already have the benchmark, and I'm going to do the profiling on top of the benchmark. So I'm going to execute the garbage, I'm going to pass the memprofile, I'm going
to generate the memprofile, and I'm going to use the pprof tool. The pprof tool is going to allow me to analyze that profile. In this case, I'm just asking for a text output, and that text output is going to show me the top consumers of memory in this case. And I can see there that 84% of the memory
is gone in OS new file. Okay, let's see what happened. Okay, it's that file, but I need more information. Well, it's that function, sorry. I need more information. Actually, I kind of like this output, but if you don't like this
output, you can, for example, use SVG, and you are going to get something like this that is very visual, and actually it's kind of obvious that where is the bottleneck there, and in this case, again, it's OS new file. If I go to the pprof tool again, and instead of that, I use the
list of a function, I am seeing here where is the memory going by line, and here I can see that in the line 127 of the file, fileunix.go, I'm consuming the memory. Actually, there you see 74 megabytes. That is
because it's counting all the allocation and aggregating all the allocations. It's not, every operation here is consuming only 120 bytes. Okay, the same with CPU profile. In this case, this is generating the most of
the CPU consumption is in Cisco 6. I can see in SVG, this time it's more scattered, so the CPU is consuming in way more places, but still the Cisco 6 is the biggest one. So I'm going to list that, and I see some
assembly code. Probably you are not going to optimize more this function, so probably this is not the place that you should be looking for optimizations. Anyway, this is an example of getting to the root cause during the profiling. Okay, this talk is going to be more by
examples. I'm going to try to show you some examples of optimizations. It's just to show you the process more than the specific optimization. I expect you learned something in between, but it's more about the process.
Okay, one of the things that you can do is reducing the CPU usage. This is a kind of silly example. You have a fine function that have a needle and a high stack, and just go through the high stack and search for that needle and give you the result. This is looping over
the whole string, sorry, the whole slice. I'm going to do a benchmark, the first thing I'm going to do the benchmark, I'm going to generate a lot of strings, and I'm going to do a benchmark looking for something around in the middle. It's not exactly in the middle,
but it's around there. And the benchmark is saying that it's taking nearly 300 nanoseconds. If I just early return, that is just a kind of silly optimization, it's not super smart or something like that. I'm going
to save basically almost the half of the performance. This is because the benchmark is doing something really silly and it can vary depending on the data that it inputs, but it's an optimization of just doing less. That is one of the best ways of optimizing things.
Well, reducing allocations. One of the classic examples of reducing allocations is when you are dealing with slices. When you have a slice, for example, this is a common way of constructing a slice. I create a slice, I generate the loop and start appending things
to that slice. Okay, fine. I'm going to do a benchmark for checking that. And it's taking 39 allocations and around 41 megabytes per operation. Okay, sounds like a lot. Okay, let's do it. Let's do
this. Let's build the slice, but we are going to give an initial size of a million. And the time I'm just setting that. The final result is exactly the same, but now we have one allocation and we have consumed only one megabyte. And naturally, if you see there,
it's around 800 microseconds. And here you have around, well, let me check, around 10 milliseconds.
So it's a lot of time actually, a lot of CPU time too. But you can squeeze it more. If you know that at compile time, if you know exactly the size that you want to have at compile time, you can build an array. It's faster than any slice actually. So if I build an array, I'm now doing zero allocations,
zero heap allocations. It's going to go in the stack or in binary somehow or whatever, but it's not consuming my heap allocations. And this time is 300 microseconds approximately. So an interesting thing
if you know that information at compile time. Another thing is packing. If you are concerned about memory, you can build this extract and say, okay, I have a boolean, I have a float, I have an int32. And the Go compiler is going to align my extract to make it more efficient and work better
with the CPU and all that stuff. And in this case, it's just adding seven bytes between the bool and the float and four bytes after the integer to get everything aligned. Okay, I build a slice and initialize that slice and I'm allocating one time
because that's what the slice is doing. And I'm consuming around 24 megabytes per operation. If I just reorganize the struct, in this case, I put the float at the beginning, then the int32 and then the boolean, the compiler is only going to add three bytes. So the whole structure is going to be
smaller in memory. And in this case, now is 16 megabytes per operation. So this kind of optimization is not going to save your day if you are just creating some structs. But if you are creating millions of instances of an struct, it can be a significant amount of memory. Functioning
aligning. Functioning aligning is something that the Go compiler does for us. It's just taking a function and replacing any call to that function with the code that is generated by the function. I'm going to show you a very dumb example. I'm not aligning this
function explicitly and I'm using the aligned version that is going to be aligned by default by the compiler because it's simple enough. And then I'm going to execute that. I'm saving a whole nanosecond there. So yeah, it's not a great optimization, to be
honest. Probably you don't care about that nanosecond. But we are going to see why that is important later, not because of nanosecond. I'm going to talk now about escape analysis. Escape analysis is another thing that the compiler does for us and basically analyze our variables and decide
when a variable escapes from the context of the stack. It's something that is no longer able to get the information from the stack or store the information from the stack and be accessible where it needs to be accessible, so it needs to escape to the heap. So it's what generates that allocations.
And we have seen that allocations have certain implications. So let's see an example here. This is a not-in-line function that returns a pointer that is going to generate an allocation. This is something that returns by value.
A value is going to copy the value to the stack of the caller, so it's not going to generate allocations. So we can see that in the benchmark that is saying the first version have one allocation and it's allocating eight bytes, and the second one have zero allocations. And actually you can see there is one allocation and it's taking 10 times more
to do that. 10 times more is, in this case, is around 12 nanoseconds. That is not a lot, but everything adds up at the end, especially when you are calling millions of times the things.
Okay, and one interesting thing is escape analysis plus inlining. Why? Well, imagine this situation. You have a struct, a function that generates or instantiates that struct, and the constructor of that extract. The constructor returns me a pointer and does all
the stuff that it needs. Okay, great. It is generating three allocations and it's consuming 56 bytes per operation. Okay, what happen if I just move the logic of that initialization process into a different function?
If we do that, suddenly the new document is simple enough to be inlined. And because it's inlined, it's no longer escaped. So it's no longer needed that allocation. Something that simple allows you to just reduce the number of allocation of certain types when you have a constructor.
What I will suggest is just keep your constructor as simple as possible, and if you have to do certain complex logic, do it in an initialization function. Well, if that doesn't hurt the readability. Okay, let's see here. We have less allocations.
We have now two allocations and 32 bytes per operation, and the time consumed is, you are saving 50 nanoseconds every time you instantiate that. So this is a good chunk. Okay. Well, this is, optimization sometimes is a matter
of trade-offs. Sometimes you just can do less, like less allocations, less CPU work, less garbage collector pressure. All that stuff is things that can be done. But sometimes it's not about doing less.
It's about consuming different kind of resources. I care less about memory, and I care more about CPU or all the way around. So concurrency is one of the cases where you need to decide what you want to consume, because go routines are really cheap, but are not free at all.
So let's see an example with IO. This is two functions that I created. One is a fake IO that is going to generate some kind of IO, IO simulated by simulation, by time of sleep. And then you have the fake IO parallel
that received the number of go routine, and it's doing basically the same, but distributing all that 100 cycles between different go routines. And I built a benchmark to do that using three different approaches.
One is serial one, the no concurrency. The other one is concurrency using the number of CPUs in my machine, and the other one is using the number of tasks that I have. And because this is IO, this is the result. I'm going to see that if I create one go routine per job,
the number of bytes per operation and the number of allocation is going to spike. But the time that is going to be consumed is going to be way lower. Actually, I'm able to execute 100 times
this function using this one go routine per job approach, and only 12 using one CPU per job. Because this is IO, so let's see what happens if I do that with CPU. Using the CPU, this is to simulate some CPU load
and using MD5 sum, and it's more or less the same approach as we saw in the fake IO. The benchmark is exactly the same approach. We are using the number of jobs and the number of CPUs and using no go routines. And here is interesting because if you use
the number of CPUs and this is a CPU workload, that is what is going to do the best efficiency. You can see here that executing one go routine per job is going to be even slower than executing that in serial. And actually, you have the worst of both worlds.
You have plenty of allocations, plenty of memory consumption, plenty of time consumption, and you are not gaining anything. In the case of CPU, you are consuming more memory, you are consuming more memory, and you are getting better CPU performance
because you are basically spreading the job all over your physical CPUs. And the serial one is just doing everything and is using only one core of your CPU. So whenever you want to optimize using concurrency, you have to take into consideration what the kind of workload that you are using
is the CPU workload, is it higher workload? Do you care about memory? Do you care about CPU? What do you care about? So that's the whole idea. I just want to explain that all this is about measuring everything,
measuring all this, doing all these benchmarks, doing all these kind of experiments to see if you are getting improvement on the performance and iterate over that. That's the main idea. I show some examples of how you can improve things and some of them can be applied in general basics
like using the, try to keep constructor small or using the constructor for slices when you know the size and things like that. Some references. Efficient Go is an early book that is really, really interesting.
If you are really interested into efficiency, Bartolomei Prokka wrote that book and actually is going to give a talk after the next one. So I'm not sure it's going to be super interesting. High-Performant Workshop from Dave Cheney. There's a lot of documentation about that workshop that Dave Cheney did and it's really interesting also.
The GoPerf book is a good lecture also. And Ultimate Go course from Arden Labs is also an interesting course because it's giving you a lot of foundation and the course take a lot of, cares a lot about hardware sympathy and all that stuff.
Well, some creative common, all the images are creative common so I put the reference here because it's creative commons. And thank you. That's it. Thank you.