We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Profile, Optimize, Repeat: One Core Is All You Need™

00:00

Formal Metadata

Title
Profile, Optimize, Repeat: One Core Is All You Need™
Title of Series
Number of Parts
131
Author
Contributors
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Your data analysis pipeline works. Nice! Could it be faster? Probably. Do you need to parallelize? Not yet. Discover optimization steps that boost the performance of your data analysis pipeline on a single core, reducing time & costs. This walkthrough shows tools to identify bottlenecks via profiling, and strategies to mitigate those, demonstrating them in an example. To improve our memory and runtime performance we will use numpy, numba jit-ing and pybind11 extensions.
User profileCore dumpOrder of magnitudeForestLocal ringRandom numberComputer-generated imageryPredictionCodeRange (statistics)Array data structureStack (abstract data type)Cartesian coordinate systemProfil (magazine)System callMathematical optimizationGraph coloringFunctional (mathematics)Visualization (computer graphics)outputCodeTwitterMathematical analysisBlogSoftwareScalabilityMedical imagingVirtual machinePredictabilityMultiplication signRow (database)Scaling (geometry)Parallel portWorkloadProgrammschleifeRandomizationComputer configurationReduction of orderMultiplicationMachine learningSoftware engineeringPattern languageSocial classDimensional analysisCASE <Informatik>MeasurementBasis <Mathematik>DialectForestDifferent (Kate Ryan album)2 (number)Classical physicsSubject indexingBitLevel (video gaming)PixelFunction (mathematics)Electronic mailing listPoint cloudPoint (geometry)Gene clusterSemiconductor memoryCore dumpMiniDiscComputer fileStructural loadShape (magazine)Right angleAlgorithmMappingDigital electronicsComputer animationLecture/Conference
User profileMeasurementSemiconductor memorySemiconductor memoryMathematical optimizationMultiplication signProfil (magazine)Disk read-and-write headDifferent (Kate Ryan album)CodeVirtual machineComputer animation
CodeSemiconductor memoryDeterminismMeasurementBlock (periodic table)Function (mathematics)Overhead (computing)System callCodeInformationSemiconductor memoryProfil (magazine)Functional (mathematics)Block (periodic table)BefehlsprozessorDifferent (Kate Ryan album)State of matterType theoryComputer programmingMultiplication signResultantVirtual machineOverhead (computing)Bit rateMereologyScaling (geometry)MultilaterationComputer animation
Visualization (computer graphics)Graphics processing unitBefehlsprozessorSemiconductor memoryData typeBit error rateMultiplication signSemiconductor memoryCodeVisualization (computer graphics)Profil (magazine)Parameter (computer programming)Revision controlComputer programmingDifferent (Kate Ryan album)Scaling (geometry)BefehlsprozessorBlogElectronic mailing listRight angleFunction (mathematics)Bookmark (World Wide Web)System callComputer animation
InformationRun time (program lifecycle phase)Line (geometry)Physical systemMereologyLibrary (computing)Multiplication signArithmetic meanProfil (magazine)Line (geometry)2 (number)Physical systemNumberComputer programmingComputer fileRight angleSubject indexingGraph coloringSocial classComputer animation
File formatBinary fileSystem of linear equationsWebsiteBinary fileLink (knot theory)MiniDiscComputer fileMultilaterationDifferent (Kate Ryan album)BenchmarkComputer configurationData storage deviceSemiconductor memoryComputer animation
Type theoryProfil (magazine)Multiplication signVector spacePoint (geometry)Drop (liquid)Computer animation
Read-only memoryCompact spaceRepresentation (politics)Core dumpOverhead (computing)Type theorySubject indexingUniqueness quantificationSocial classDampingArray data structureMultiplication signPrice indexGraph coloringSubject indexingFunctional (mathematics)SpacetimeDimensional analysisCalculationShape (magazine)Vector spaceProgrammschleifeNumberType theoryCodeElement (mathematics)Interface (computing)Different (Kate Ryan album)Semiconductor memoryUniqueness quantificationMedical imagingOperator (mathematics)DataflowVariety (linguistics)Cartesian coordinate systemFinite differenceComputer animation
Run time (program lifecycle phase)Physical systemMultiplication signSocial classProfil (magazine)Price indexSemiconductor memoryFunction (mathematics)Array data structureoutputComputer animation
Computer engineeringPredictionMusical ensembleComputer programmingSemiconductor memoryStructural loadArray data structureMultiplication signMedical imagingProfil (magazine)Social classComputer animation
Reduction of orderSemiconductor memoryType theoryLoop (music)Broadcasting (networking)Line (geometry)Data typeProgrammschleifeDifferent (Kate Ryan album)Data storage deviceNumberSlide ruleBroadcasting (networking)Maxima and minimaSemiconductor memoryType theoryArray data structureCartesian coordinate systemElectronic mailing listSocial classSubject indexingCalculationDefault (computer science)BitGreatest elementSpeicherbereinigungOverhead (computing)Computer animation
Meta elementDrop (liquid)Physical systemLine (geometry)Run time (program lifecycle phase)Multiplication signType theoryRun time (program lifecycle phase)Computer animation
Element (mathematics)CodeCASE <Informatik>Pattern languageLoop (music)NeuroinformatikSemiconductor memoryOverhead (computing)Element (mathematics)Structural loadRun time (program lifecycle phase)TesselationGraph coloringFunction (mathematics)Program slicingPosition operatorDimensional analysisLatent heatComputational complexity theoryResultantAreaSingle-precision floating-point formatData storage deviceMereologyNumberDefault (computer science)Computer fileSubject indexingTransformation (genetics)ProgrammschleifeoutputComputer configurationExterior algebraObject (grammar)Positional notationArray data structureOperator (mathematics)Functional (mathematics)Line (geometry)BitSquare numberPixelScaling (geometry)Moment (mathematics)LogicIterationUniform resource locatorComplete metric spaceCoroutineComputer animation
Element (mathematics)Raw image formatMultiplication signFunctional (mathematics)Data structurePointer (computer programming)Run time (program lifecycle phase)Level (video gaming)Electronic signatureNumberArithmetic meanoutputFunction (mathematics)Different (Kate Ryan album)Subject indexingPrice indexSubsetPrimitive (album)Semiconductor memoryCasting (performing arts)InferenceElement (mathematics)TesselationRevision controlArrow of timeWritingType theoryRepository (publishing)Maxima and minimaCalculationLoop (music)PixelProcess (computing)Limit (category theory)Game controllerExterior algebraCodeExtension (kinesiology)Array data structureProgrammschleifeGraph coloringAsynchronous Transfer ModeParameter (computer programming)Medical imagingCASE <Informatik>Error messageDampingLine (geometry)BitElectronic mailing listComputer animation
Level (video gaming)User profileLine codeVector spaceSemiconductor memoryDirection (geometry)outputBroadcasting (networking)Level (video gaming)Grand Unified TheoryArray data structureRevision controlCode2 (number)Multiplication signProgrammschleifeBitScaling (geometry)Reduction of orderType theoryGame controllerComputer programmingTesselationNumberComputer animationPanel painting
Multiplication signCore dumpMathematical optimizationLipschitz-StetigkeitScaling (geometry)Profil (magazine)CodeMehrprozessorsystemConfidence intervalComputer animation
BlogDifferent (Kate Ryan album)NumberLibrary (computing)CodeProfil (magazine)Slide ruleQR codeComputer animation
MIDIBit error rateAbstract syntax treeProfil (magazine)BefehlsprozessorMaxima and minimaMultiplication signFile formatRepresentation (politics)LengthTracing (software)Functional (mathematics)Array data structureDeterminismString (computer science)Run time (program lifecycle phase)Physical systemDecision theoryMathematical optimizationType theorySemiconductor memoryCodeResultantParsingNumeral (linguistics)Revision controlMereologyDigital photographySocial classFerry CorstenDifferent (Kate Ryan album)Resource allocationNatural numberCASE <Informatik>Structural loadGoodness of fitBitCodierung <Programmierung>outputSystem callFunction (mathematics)Binary codeMiniDiscData compressionPlotterSoftwareCycle (graph theory)Binary fileMemory managementInformationLibrary (computing)DampingProof theoryCartesian coordinate systemLecture/ConferenceMeeting/InterviewComputer animation
Transcript: English(auto-generated)
Welcome to our talk, Profile Optimized Repeat, One Call is All You Need. My name is Valentin. I'm a software engineer and machine learning engineer at Scalable Minds. We are based in Potsdam, Germany. Our customers are research facilities that are interested in understanding the brain,
and we help them by reconstructing neurons from electron microscopy data. Hi, everyone. I'm Jonathan. I'm also a software and machine learning engineer working at AI Agnostics. At AI Agnostics, we are transforming biomedical data into insights, for example, microscopy
images of cancer tissue as insights for pathologists and cancer researchers. You can find me on Twitter or have a look at my still slightly empty blog. So we both have a pattern in our chops that we've seen from time to time, and we're
analyzing biomedical data, microscopy images. This analysis often happens in the cloud or on high performance clusters, and then we generate insights. So what happens is that the microscopes become faster, they become bigger, and we have
more microscopes, more clients, and at some point, you will see, okay, my analysis pipeline that used to work really nice now is too slow, or it runs out of memory. So we are out of time, we are out of memory, one of those two options, and it happened to us multiple times in our chops.
So one first reaction that you might have is, okay, what do we do? We scale our cluster. We scale our cloud machines, more machines, beefier machines. Well, maybe that's not the first thing to do. Maybe we don't scale up or out yet, but instead, we just use one core.
One core is all you need to optimize your workloads. So once you look at optimizing your workload on a single core, you can get improvements, and those improvements will also pay off once you scale later. This is work that you can use anyways.
Parallelization needs resources, so those beefier machines, the more machines, they cost money, it costs time to set this up, to maintain, and also for some code, it might be hard or even impossible to just parallelize it and to scale this out even. So for example, if you have a MapReduce classical example, the reduce step might be a bit harder to parallelize.
So let's have a look at a practical example here. This example is based on a circuit image tutorial. So we see on the left-hand side a skin tissue, and they train a random forest classifier to segment this skin into different regions.
And let's assume that we have at our hypothetical company a code basis that visualizes these predictions so that they look like at the right-hand side. So we have a color for each class, and then we modulate the intensity based on the
probability of the class. So in our case, the predictions are already there. They are stored on disk, and this is how they look like. We have a 3D array, so a 2D image and a channel dimension with four classes, one channel for each class.
And then we want to map that to an image that has the same 2D shape, but then three channels, RGB, to get our colors. The code that we already have and it's working, but it's not good enough, has two functions. The first one loads the data that is stored on disk.
We have a CSV file for every class, so we read one class after the other and then stick them together in the channel dimension to have this prepared for our next function that then maps these probabilities to our output visualization.
So first, we set up our class colors, so one color for each class, and initialize our output list. We then go through every row in our input image, and we have three, four loops. So the first one goes through every row, and then the second one through every item in that row, and then the third checks all the probabilities for that certain pixel,
and we need the class index, or which class has the highest probability, and the probability itself, so that we can then after select the right color and then scale it based on the probability.
We then append this then to our rows, and append then the rows to our image back again, and this is our example. Throughout this talk, we will start with this example, and it takes 15 seconds for the input we showed before, and it takes 100 megabytes peak memory, and we will profile
and optimize and iterate on that, and we show measures to take, and they will advance throughout the talk, and eventually we end up with an algorithm that is 100 times faster and needs just a tenth of the peak memory. But first we need to profile and see where we should optimize.
So before we start our profiler, let's look at what profiling actually is. So in profiling, we measure a piece of code to identify and analyze bottlenecks, and here
we want to look at speed and memory bottlenecks, and why do we analyze them to mitigate later in our optimization steps. So as a heads up, this is different from benchmarking, for example, where you are interested in differences across time or machines.
So the first thing we're looking in is time or CPU profiling. Here we are interested in where is my code slow, how slow is it, and optimally, why is it slow? Because that helps to inform what we want to optimize later. For memory profiling, we look at high memory usage, where does it happen again, how much
memory do we use, and maybe also why. There's one important difference, in my opinion, between different profilers, and the one type of profilers are instrumenting profilers, which is deterministic because each function
or code block that I want to profile is instrumented by the profiler. So it adds hooks into my code that gives the information that I want, how long does it take, how much memory does it use, but those hooks come also with a potential overhead, and once you have those hooks and you also want to profile larger chunks
around those hooks, you start to get inaccuracies because this overhead then is part of your program. You're mutating your program with a profiler while profiling it, which might give you false results or false impressions. On the other hand, you have sampling-based profilers, so those are statistical, and
they peek into your program from time to time. They look at the state of your program and extract the information needed from it. This doesn't come with the overhead from the instrumenting profilers, but also for very brief invocations in your code, they might miss this, or just if you peek after
your huge memory consumption and just before your huge memory consumption, you might miss this in between. But as long as the sampling rate is fine-grained enough or your code runs long enough, you get a good impression of your code. So for timing, this should be fine because we're interested in timescales anyways.
For memory, you might miss some brief things here. In this code, we look at Scalene. Scalene supports both CPU timing and memory profiling, as well as also GPU.
It's sampling-based, which is nice especially for the timing profiles, and it has a visualization that is very useful here in this talk. That's why we chose it. There are many more profilers you can use. Here's a list I compiled last year on my blog.
Those four here are my personal favorites I really like, PySpy, MemRay, Scalene, and Austin. They all have different pros and cons, just some tools that I personally prefer. So let's see how do we use Scalene now in our example. First we have to import it to our program that we want to profile, and then we place
a start and a stop right before and after our code we are interested in. And then to actually run the profile, we call Scalene with a few parameters. We want to shut it off in the beginning, so then it just starts where we want to profile the code exactly, and we want to have the CLI output, and in the beginning
interested in the CPU profile only. We then pass the name of our Python program, and if we have additional parameters for our Python program, we can add them separated by these three dashes. So we have multiple versions of our program.
This is why we have this parameter, and we have different data scales as well to make sure we have a decent sampling time. So let's look at our profile for our first program. So this is how the profile looks like. We already know that it takes 15 seconds.
On the left you see four columns, and on the right you see the program we have. The four columns are the line numbers, and then three times our time, so the Python time spent in Python, and then we have native and system. Native means that we are in some precompiled library, and system often hints at IO, for
example. So we now see that we have basically two main parts where we spend time. The first part is at the top where we spend almost a quarter of the time loading the data, and then we have the second part where we index into our class colors and
then scale the RGB values. So we will look at the first, the loading first, because it's a low-hanging fruit. The CSV file we load is text-based, and our data is floats that can be nicely represented
as binary and can be easily parsed in a binary format. So there are good two arguments, so it's faster to parse, and it takes less storage on disk. There are many different binary formats we can choose from. We use HDF5 here in our example, and we have a link down below to a website that lists
or benchmarks different binary formats if you're interested in checking out that later. So we now have to convince our colleagues that they will store the probabilities in an HDF5 file, and if we succeed, we can now read them with HDF5 and profile again,
and we see that's much faster, and it doesn't even appear in our profile because scalene drops percentages that are below 1% point.
So we still see that the majority of the time has been in this type loop, and the question now is what can we do, and one answer is vectorization. So currently we are using Python types to do our calculations. They are very powerful, but they come with an overhead, and if we can make the assumption
that all our elements are float64, for example, and they are one after the other in the memory, then we can write much more efficient code, and this is what NumPy does. So NumPy has NumPy arrays, and they implement a variety of different operations on them so you can add them, you can multiply them.
They have even more sophisticated functions like arcmax-unique, and they do fancy indexing, and fancy indexing we'll see in a minute. NumPy had its course written in C, so this is where it's getting its speed from. So let's have a look at our code rewritten in NumPy.
So we still get our probabilities to our function, and then first we need the predicted classes, so we use arcmax. You see that we do not have any for loops anymore, so we're working on each array at one time, and the function is applied on all elements of each array.
So we first get the indexes for classes, and then we still need the probabilities, so we do that along the channel dimension in both times. We have our class colors, and then we do our fancy indexing.
So we have the indices, the predicted classes, which is a 2D array with values between zero and three, which then index into our class colors array. This will give us what we need, and that is a 3D array with our image and then the
colors in the channel dimension. We still need to scale that, so we use the probabilities we have, but they are in a different shape, so they do not have the channel dimension, so we add that channel dimension with a NumPy new axis and then multiply those two.
So this is done for every element in the RGB space, and then we still need to convert that to UN8 because the probabilities are float 64, and the resulting array will be float 64, so we need to convert this to comply with our interface.
I really like NumPy because it's much more concise, and let's see how it performs compared to our work we had before. So it's much faster. It's more than 10 times as fast, and we can have a look at our profile, and we see that we spent time in getting the classes, the class indices, and then again getting the
probabilities for each class, and we spent time then converting this to UN8. So let's have a look at what we actually do memory-wise because we are now operating
on arrays, and the intermediate steps are materialized, so we store them in memory. Our input is almost 160 megabytes, and our output should be 15, and when we look at the profile, at the memory profile of our program, we see that peaks at 350 megabytes,
and we will focus on the peak column on the left, so that we see that we load the data, 150 megabytes, and then we indeed materialize the different arrays. We see that for the predicted classes, almost 40 megabytes, but then it peaks when
we convert our RGB image to float 64 when we scale that with the probabilities, and this is then eight times the size we actually need. So let's try to optimize this further. So we can look for exactly the data types we're actually using.
So here's a list of different D types we might use. They use different amounts of data and can store different amounts of values, and we can, besides that, do manual garbage collection, so we can get rid of arrays that are still
in scope but we don't need them anymore, and we can do something like manually unruling our loops to avoid broadcasting. This is a bit unfortunate because NumPy is very cool and you can do broadcasting and it's nice, but it comes with an overhead sometimes, and I find this surprising.
So this is something you should also look out for. So with this in mind, let's have another look at our program and rewrite it. So in the first line, we still do our ArcMux, but we convert this to UN8 because we only have four different classes and do not need N64, which is the default for ArcMux.
And then we save us some work, do not do the max calculation again, but just index with the predicted classes. This is happening in the second line. And then we do not need the probabilities anymore, so we can get rid of it.
So we mark this to be deleted for the GC. And then we can save again some memory when we convert the maximum probabilities to float 32 because that's enough precision for what we need here. At the bottom, we see that we can unroll the loop and do not do any broadcasting with the max
probabilities anymore. If you remember a few slides ago, I had to need to add this number new axis to make this work, but now I can skip the broadcasting and multiply each RGB value. So how does this compare?
When we look for the peak column again, we see that our peak at 100, or which added 100 megabytes in RAM is gone now, so this is cool. And when we look for the runtime, we are now three times as fast. And we still have bottlenecks, we always have bottlenecks, but they are now mostly
in converting the different types. So we traded off some runtime for more or less RAM. And what else can we do about RAM? So, we started out with a simple for loop over our arrays and now moved to NumPy which
just uses the whole array at once and does all the computation steps where you also have the whole array for all your intermediate steps. Can we do something in between, avoiding the Python loops because, okay, they are slower, but not loading the whole memory completely.
Why do we want to avoid that? Because memory access generally is slow. So if you're optimizing also not just for memory but also for performance, still look at your memory access patterns because those are typically costly. In our case, we can map over each single element and this means we can also just map
over subparts of our array so we do chunking. And actually, we can also only load a single subchunk of our data directly from the start, so here in the bottom left. And then transform this as before array-wise with NumPy and then store this into our
large output results and just store the different chunks there. So how does that look like in code? At first, we need to change our loading routine. For that, we have a small helper, getChunkSlices, and this gives us slices along a single dimension
for a specific chunk size. In our case, we chose 1024. And then getChunkSlices is a generator that yields the different slices. So the slice operator is just the object notation for start, colon, and as you might know it
from normal indexing, and this is just an object you can pass around for that. Then we adapt our load probabilities that we had before. So we still open our file but we don't load it all at once. Instead we have two for loops nested which gives us the slices for the X and the Y dimension,
so we have a 2D grid over our array giving us 2D chunks into that. Then we read from our HDF5 file in the bottom line just this chunk which works really nicely with HDF5 because you can directly index into the whole file and it's very efficient.
And then also here have a generator that yields back the slice because later when we have our result we still want to know where it came from to write it into the correct location and also the array itself. So now our color coding function, we actually don't need to change it all.
We are still operating on an array, just a much smaller one than before. But the function that combines all of this. Before this was just load probabilities and color code probabilities, that was it. Now we have a little bit more logic here. So in the third line, color probabilities equals numpy zero.
We initialize our output array that we need later because we need a place to store our output data in and then iterate over the different chunks that we read from load probabilities so we get the slice and the actual array. Then we apply our color code probabilities on this array and write it back on the position
of the chunk slice into our output array and return this. That's it. That does the same thing as before. But now in a tile fashion without iterating over each element with numpy for loops but still using numpy but not on the whole array at once.
So how does this compare? Before in our example we had almost 2,000 by 2,500 pixels and now we have this chunk size of 1,000 squared. And we have 10% more run time. It's a bit slower because of the tiling.
But we save memory. And most importantly, our memory now doesn't grow as much with our input size also. But you can now have larger input sizes and at the moment in this example we only scale linearly this output array but even that you might optimize away if that's fine for your use case. So here we can really save memory.
And if you want to save more memory you can have smaller chunk sizes, you might have some other trade-offs compared to the run time but now you can do that. So again, why can't we use the element-wise iteration? And that is mainly because this for loop in Python is slow in our case because we
iterate over each element in Python land that comes with an overhead as I said before. Can we get rid of the overhead? And the answer is yes, but not with default Python, but we can use number. So in this bottom part what you see again is we have our large input array.
We don't do tiling for now. We skip that again. And we again transform each element but not with the Python loop and this transformation we also try to do it as efficiently as possible without using Python. And we can do that with number chitting. So what number does is it takes your Python code and compiles that with C, C++, and has
an efficient loop over your data. There are also alternatives. You could use Python, mypyc, medworks, Cython. There are different options. Just what we choose here is number.
Let's have a look at our code, how this looks like. So in this case, the first thing you see is the number and chit decorator which is a shorthand for chit with no Python mode that just enforces that we cannot fall back into Python mode and really forces it to compile the whole function. Then what you see afterwards is a signature of our function below.
The output is the uint8 array, three dimensional, and the input is the float64 array, three dimensional again. And this is just that we do the chitting at import time. Chitting means we compile at runtime. You can also make number compile ahead of time.
That's also possible. And in this specific case, we compile at import time with this decorator where we specify everything for the function. If you don't add the signature here in the top, then you would compile just at invocation time and it would look into the actual array that you're passing to compile your code.
Then if you look roughly through the code later, we have again three for loops. This looks very much like our Python code before. Basically it's the same version again. You can also use NumPy primitives with number, but only a subset of them. In this case, this was much easier to write and it's still as efficient.
One trick that we added here is again the loop unrolling because also here in this case, we got to speed up from that. Apart from that, it's very similar to the Python code. So instead of iterating over the lists, we just use indices into our array here for x and y and index instead of returning the individual elements.
So when you first write this code, it might not look exactly like this. Then number comes when you start it and will probably throw an error. That happens to me all the time. I try it with number because sometimes you forget, okay, this needs to be a UNT8 and
number cannot infer as much things as NumPy, for example, could. So this casting to NP UNT8, you need to do manually and yeah, there are similar things, but if you just read the number arrows careful enough, then you figure out what to do here. And it's quite doable. They improved massively over the last years. I'm very happy with that.
Yep. So how does this look like performance-wise? So we compare now against our NumPy version before, the non-tiled one, because tiling is kind of an orthogonal concept to what we do here. You can do tiling still on top of that, but we leave that out for now, or chunking
as we call it. So with number, we have a speed up from 300 milliseconds to 122 milliseconds, and we use less memory. That's mainly because our intermediate memory, again, is not needed because we just process now each single pixel from our input with the probabilities, then do all the calculations
with getting the maximum probability, getting the index of the maximum, multiplying the class, and do all of that in this most inner loop, which now is efficiently compiled with number, and instead before with NumPy, we did all of these steps array-wise, which is more inefficient overall.
So OK, this is quite fast. This is very efficient. Can we do more? Yeah, we can go more low level. So there exists a repository called PyBind11, which allows you to write C++ extensions for Python, which is quite cool because it integrates very nicely with NumPy.
Besides that, there are other advantages, so you can use existing high-performance code you have lying around in C++, so you can integrate that. It gives you all the fine-grained control C++ has, so you can write your simple instructions.
You can even do inline assembly, so there's basically no limit. And alternatives are Py03, where you can write your extensions in Rust. So how would a PyBind11 extension look like for this example? Something like this. So you can use NumPy types.
So we return a NumPy array in the end, that's on the top left, and we take two arguments to NumPy arrays with the probabilities and the colors. In the first few lines, we convert these to raw pointers to get the maximum efficiency.
And then again, we have our three for loops that are very similar to the number example. One difference is here that we do not do indexing, but work on the raw pointers and increment them just one step at a time to make the prefetcher happy. And then in the end, we have this loop unrolling again and write out into our colored image.
This is then returned back to Python land, and we can work on this if it's just a NumPy array. The cool thing is that there's no copying if we use the right types. So if we pass a float64 probability array to that function, there won't be any copies
of the whole data, but there will be copies of the structure itself. That's quite cool. And the question is, do we get any better than number? And we do a little bit. We still use the same amount of memory. And this example, the difference is not that much, but it's actually there.
So when we scale up the input, the relative improvement is still there. That's great. Let's wrap it up. So as you've seen, we did a deep dive here with you. And from this pipeline version, you can do more things.
You can have your assembly in line code if you want to. Of course, there's no end. But we'll stop here for this talk and now zoom out a bit from this example again. So what we did is optimize our input program, which was the Python for loops from 15
seconds and 800 megabytes memory usage to our last version with Pybind. And we also combined that with tiling again. So there we come to 150 milliseconds and 80 megabytes again for the memory usage, as we had with the tiling example before.
And for all of those insights, we use profiling. So if you have slow code and you have a hunch, maybe it could be faster, always turn on your profiler and have a look. Because just the gut feeling often is wrong. That's at least my experience. Maybe have a better gut.
Then the first technique that we used is look at IO. Often you can change your inputs if you have that in-house and have control over it. That is very valuable. Then you can try to optimize this directly. Then we used vectorization with NumPy to change our Python for loops into NumPy array.
Then we had different techniques to reduce memory usage. So we had those NumPy tricks with the D types, with deleting our arrays by hand, and this broadcasting trick that we did as well. Then we had tiling, chunking as a technique.
And in the end, we went more low level with Pybind and NumPy. And this you can repeat until you say, okay, I don't have more time now. Now it's actually time, okay, to maybe scale out or scale up. We spend time optimizing our code and confident that, okay, maybe the time we spend now doesn't
pay off anymore. Then we can look into threading, multiprocessing. And if you want to scale beyond OneNote, there's Joplib, Spark, Dask, Ray. There has been a great talk about that yesterday. There are other talks about this. But still, one core is all you need to do profile-optimized repeat, and in our opinion,
that's the first step you should do before scaling. But still, scaling is a good idea afterwards. Some resources that we find nice here in my blog again. Then pythonspeed.com data science has a great overview, also profiling different speedup
techniques. And Calm Code is a nice overview of different libraries, also including Numba, for example, or PyInstrument as a profiler. So thank you very much. And you can find our code and slides here on GitHub. This is also the QR code for it. We'd be very happy to take a selfie with you and then look forward to your questions.
Thank you very much. Nice. Nice talk.
I mean, you managed to speed up even your talk time. So now we have 10 minutes left for Q&A. There's a microphone in the front. I already see somebody there. Please go ahead. Thank you very much for a talk. Maybe just an idea. You show in profiling picture what actually after the first optimization, mostly time spent
in the system. My suggestion is mainly about memory allocation. And you can proof above with different profiler, for example, Linuxperf, which also shows you what you're actually doing just memory allocation. And maybe that's just because you use these temporary arrays, NumPy arrays, and you use just only functions. So you create them, and you destroy them on each function exit.
Maybe just an idea to use a class to cache this array and to reuse it every time you precision the photo. And this will reduce also the memory allocation. And as I remember on your profile picture, it was the most time consuming part. Just an idea. Maybe you already used this trick.
Yeah. I mean, this example is made up for this talk, just for this talk. So yeah, we also spent some time optimizing it, but didn't go into all the details we could do with all the different versions. But it's a very good idea. And since you mentioned perf, also I think since Python 3.12, I think it's possible
to use Linux profiler perf on top of Python, which gives you nice insights into your Python code, plus additionally low-level invocations that you can now also trace with perf nicely on top of Python. Thank you. Thank you for that talk. Thank you.
Hi. Thank you for the talk. My question is about the profilers. You mentioned there were two types of profilers, the sampling one and then there was another type. Do you sometimes have to use both types to see if you get consistent results, or if one
of them missed something? Yeah, especially if you care about timing versus memory. So in this specific case, we use sampling because it's useful for timing. Also whenever you saw memory versus CPU plots, we profiled differently, one only for the
memory access and one time only for the times, because if you do both at once, the memory access also takes some time, the profiling itself, so it skews your runtime results. But in this case, the memory access is somewhat okay, but it varies slightly because it's sampling-based, and as we said before, you might miss memory allocations.
For this example, it's good enough. But I would recommend for memory to use a deterministic instrumenting profiler usually, and for time a sampling-based, if you have enough time in sampling. Because for sampling, you just care about time, and it gives you a good representation
of where you spend time. That's just by the nature of sampling on the time axis. And for memory, you rather care about the exact amounts and exact allocations and the different points, but you shouldn't combine those two usually. Thank you.
Yeah. Just for this specifically, we also wanted to have one single output of one single profiler and not mix this up too much. That's why we stuck to Scalin here. Thank you for the talk. Do I understand correctly about the efficient IO? You basically converted your input formats to some binary format so that you get some
benefits? And how did it help? Because when you convert it to some binary format, you need to serialize it to memory, and that's probably taking a little bit more time. I don't know if I understood correctly why you should need more time with a binary format? Because it's compressed, actually. For example, you mentioned some compressed binary formats, and if it's plain JSON or
XML, then you just parse it. It's already not compressed, but for example, if it's Protobuf or it's other binary formats, if you parse it, first you need to unpack it, uncompress it, and then load into memory, and that's... I mean, I assume that will take more time.
Yeah, true. If you compress, then you're absolutely right. In this case, we did not do any compression, but it's still smaller if you do not do a text-based encoding of your floats. However, here again, you're trading for decompression CPU cycles with disk or network
or whatever it is where your data is stored access, which is way slower. So in many cases, if you have efficient compression and efficient decompression especially, then loading smaller data and decompressing it can be way faster still than having the uncompressed version.
There are nice examples of this. I know for SAR, definitely, you can choose your compressor, for example, and play with it having an uncompressed version versus different compressors, and there are nice papers about exactly this question. Thank you. Maybe one example for JSON loading, I believe you can have libraries that expect a specific
schema for your JSON and also parsing with this, you have more information about your input. It might not be a binary schema on disk, but still loading it with this information up front is also faster than just parsing it as usual.
Hi, thank you for the talk. I have one question about the data that you're processing. Basically the example was like numeric data, and I'm wondering if you have any advice on like parsing and like processing strings, how to deal with this problem, speed up
this type of code. To be honest, we don't have much experience with strings ourselves. So the only recommendations I know basically is checking the encoding, if you can have an efficient encoding, if you don't need full Unicode support, that helps up front
just also to optimize your data format on disk. If you have a maximum length, those are typical techniques you can use. I think it should be a very conscious decision about arbitrary length strings or fixed length length strings. For example, SAR I think has support for both, if I'm not mistaken.
And there you can play with those parameters, but beyond that, I'm not an expert myself, to be honest. Thank you. Good question. Okay, so are there any more questions? We still have five minutes. Otherwise you will get five more minutes, grab another coffee or something, and give
it up again for Valentin and Jonathan.