We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

How we used vectorization for 1000x Python speedups (no C or Spark needed!)

00:00

Formal Metadata

Title
How we used vectorization for 1000x Python speedups (no C or Spark needed!)
Title of Series
Number of Parts
131
Author
Contributors
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Want to make all your code faster? With matrices, library knowledge, and a sprinkle of creativity, you can consistently speed up multivariate Python functions by 1000x! Modal optimization requires simple axioms - arithmetic, checking a case, calling the right sklearn function, and so on. When that’s not sufficient, three core tricks - converting conditional logic to set theory, stacking vectors into a matrix, and shaping data to match library expectations - cover the vast majority of real world cases (90% of the ~400 functions we vectorized). At Bloomberg, ESG (Environmental, Social, and Governance) Scores require complex computations on large data sets. Time-series computations are fundamental for Governance - one UDF infers board support for a policy from prior cyclical votes and other time offset inputs. By rewriting the pandas backfill as a series of reductions on a 4-tensor, we reduced the runtime from 45 minutes to 10 milliseconds! Analogously, due to real world complexity, finance UDFs can end up with 100+ if/else branches in one function. With a mix of De Morgan’s laws and sparse matrix representations, we simplified the cases and achieved 1000x+ speedups. We’ll conclude with a quick overview of cutting-edge tools, and hope you’ll leave with a concrete strategy for vectorizing financial models!
UsabilityRouter (computing)ComputerMathematicsComputational physicsStatisticsMachine learningPhysicsElectronic program guideIntegrated development environmentComputer iconData modelBlock (periodic table)BuildingOperations researchCodeSoftware developerSoftware testingComputer-generated imageryComputer fileSource codeVector processorProduct (business)CodeBlock (periodic table)BuildingSlide ruleCartesian coordinate systemRow (database)Library (computing)YouTubeVideoconferencingOperator (mathematics)Electronic program guideMathematicsReading (process)Line (geometry)Endliche ModelltheorieStatisticsComputer animationLecture/ConferenceMeeting/Interview
Source codeComputer-generated imageryCodeSoftware developerSoftware testingBit rateLogarithmMathematicsOperations researchBlock (periodic table)BuildingSuccessive over-relaxationOrdinary differential equationCodeMultiplication signDiscounts and allowancesPrice indexEnergy conversion efficiencyBuildingRow (database)MiniDiscProduct (business)Process (computing)Perspective (visual)Integrated development environmentSpacetimeRun time (program lifecycle phase)Block (periodic table)Programming paradigmSoftware testingUniverse (mathematics)Hardy spaceLogarithmOpen sourceCASE <Informatik>Series (mathematics)Heegaard splittingVirtual machineCore dumpDefault (computer science)Functional (mathematics)Context awarenessArithmetic meanSimilarity (geometry)Equaliser (mathematics)Point (geometry)InterpolationResultantParallel portOperator (mathematics)Order (biology)LoginElement (mathematics)MathematicsLibrary (computing)Vector processorMatrix (mathematics)Software developerSubject indexingShape (magazine)Array data structureBroadcasting (networking)outputLaptopScaling (geometry)Loop (music)AreaEndliche ModelltheorieMultiplicationSet (mathematics)NumberAdditionInheritance (object-oriented programming)Different (Kate Ryan album)StatisticsUltraviolet photoelectron spectroscopyFunction (mathematics)Single-precision floating-point formatFrame problemElectronic mailing listLambda calculusExpected value2 (number)IterationGreen computingLecture/ConferenceComputer animation
Library (computing)InterpolationBlock (periodic table)BuildingRandom numberInterior (topology)Ordinary differential equationPhysical lawLogicCondition numberCode refactoringMilitary operationoutputError messageWikiMusical ensembleArtificial neural networkTheoryData modelPersonal digital assistantElectronic mailing listFrame problemReal numberObject (grammar)Data conversionComputer fileSocial classArrow of timeDatabaseRemote procedure callType theoryBlock (periodic table)Web pageNormal (geometry)Computer hardwareNumberBefehlsprozessorMultiplication signAdditionMultiplicationMathematical optimizationDatabaseCache (computing)Operator (mathematics)Form (programming)Fourier seriesProcess (computing)Ultraviolet photoelectron spectroscopyQuicksortMiddlewareAlgorithmThread (computing)Medical imagingGraph (mathematics)System callLibrary (computing)Revision controlEndliche ModelltheorieLevel (video gaming)Physical systemDivisorFourier transformFocus (optics)Vector processorWrapper (data mining)Virtual machineMatrix (mathematics)Broadcasting (networking)Functional (mathematics)Single-precision floating-point formatElasticity (physics)Representational state transferCondition numberInheritance (object-oriented programming)TheoryPhysical law2 (number)Network topologyBlogLinear algebraSet (mathematics)Control flowData transmissionHill differential equationData dictionarySerial portCASE <Informatik>Computer scienceMereologySemiconductor memorySignal processingComputer architecturePermutationIdentity managementDimensional analysisDescriptive statisticsSymmetry (physics)BitDiscrete groupLoginCoprocessorLoop (music)Type theoryData structureImage processingHelmholtz decompositionDiagonal matrixCartesian coordinate systemLogicCodeDomain nameSet theoryInferenceTransformation (genetics)Internet forumProduct (business)Computer animation
Matrix (mathematics)Normed vector spaceVector graphicsStack (abstract data type)Weight functionWeightData modelLibrary (computing)Cartesian coordinate systemAxiom of choiceLoop (music)Range (statistics)Price indexSeries (mathematics)Moving averageMountain passUniqueness quantificationLogicSubject indexingCASE <Informatik>CodeMoving averageVector spaceTime seriesLibrary (computing)Array data structureMultiplication signSparse matrixMixed realityCartesian coordinate systemNumberSummierbarkeitProgrammschleifeField (computer science)Different (Kate Ryan album)Parameter (computer programming)Latent heatWindowIntegerAverageFunctional (mathematics)Real numberElectronic program guideResultantDrop (liquid)Metropolitan area networkLine (geometry)Range (statistics)Row (database)StatisticsTheorySet (mathematics)Endliche ModelltheorieInterior (topology)Point (geometry)Identity managementWeightScaling (geometry)Default (computer science)Boolean algebra1 (number)Right angleDistanceMatrix (mathematics)Loop (music)CalculationModule (mathematics)FrequencyTable (information)Dimensional analysisGeometric quantizationVector processorRule of inferenceRegulator geneNetwork topologySystem callNormal (geometry)Cumulative distribution functionSerial portSemiconductor memoryGoodness of fitMusical ensembleComputer animation
Series (mathematics)Latent heatLogicElectronic program guideNormed vector spaceStatisticsScale (map)PseudodifferentialoperatorCartesian coordinate systemLoginConvex setEvent horizonLibrary (computing)Right angleAssociative propertyQuicksortCASE <Informatik>Group actionElement (mathematics)Level (video gaming)CausalityMoving averageDependent and independent variablesBlock (periodic table)CircleVector spaceVector graphicsThresholding (image processing)Vector processorMereologyHash functionGreatest elementLogicHardy spaceNumberMaxima and minimaBranch (computer science)Condition numberTable (information)Software bugExpert systemPrice indexBuildingRow (database)Multiplication signSubject indexingResultantCodeFunctional (mathematics)Discounts and allowancesOrder (biology)Electronic mailing listField (computer science)Mathematical analysisTheoryGame controllerEndliche ModelltheorieInformationLink (knot theory)CuboidOperator (mathematics)KnotDifferent (Kate Ryan album)Set (mathematics)Exterior algebraFrame problemMappingNormal (geometry)Real numberEmailSerial portTime seriesDivisorWeb applicationType theoryLinear algebraLoop (music)Revision controlComplex analysisControl flowOdds ratioBound stateComputer animation
Magneto-optical driveSystems engineeringDesign of experimentsReading (process)Ordinary differential equationOctahedronBitMultiplication signHTTP cookieCASE <Informatik>Point (geometry)DivisorComputer animationLecture/ConferenceMeeting/Interview
System on a chipVector processorThumbnailRule of inferenceGoodness of fitBefehlsprozessorCodeProgrammer (hardware)Transformation (genetics)Ultraviolet photoelectron spectroscopyMeeting/InterviewLecture/Conference
Router (computing)Systems engineeringTheoryCodeSoftware testingCASE <Informatik>Code refactoringFunctional (mathematics)Multiplication signUniverse (mathematics)Extension (kinesiology)Lecture/ConferenceMeeting/Interview
CodeCASE <Informatik>NumberState observerDegree (graph theory)Endliche ModelltheorieProjective planeMultiplication signMathematicsLecture/ConferenceMeeting/Interview
FreewareRow (database)Process (computing)Right angleVirtual machineVector processorCASE <Informatik>Table (information)Semiconductor memoryFrame problemBlock (periodic table)Cartesian coordinate systemLecture/ConferenceMeeting/Interview
Data acquisitionRoundness (object)Lecture/ConferenceComputer animation
Transcript: English(auto-generated)
I'll take this one. Hello, everyone. Great to see you. Quick intros. I am Justine Wiesner. I'm a team lead of the ESG Quant Engineering team at Bloomberg. ESG stands for Environmental, Social, and Governance. So we produce ESG scores for companies.
I'm Canadian, proud Canada stan, and I love math and running and other things. And I'm Jonathan Hollenbeck. My background's in CS and math. So I also love math, statistics, and just bringing things to production and making them fast.
All right, so here's a quick overview of what we're going to talk about today. First, I'm going to try and persuade you about why you should pay attention for the remaining 40 minutes, why we think this is important. Then we're really going to jump in into the building blocks. Our intent is for this to be not only motivational, inspirational, but also a very practical guide. So all of the code we've included in our slides
is complete working code. So for those of you in the room, maybe you don't need to read line by line, but our intent was that someone using these slides after the fact or the video on YouTube can really use this to implement it. So we'll go through some building blocks. Then Jonathan will take us below the surface and explain why this actually is so much faster,
how it works. Then we'll take a step back, big picture, and think about the mental model, how you should approach thinking about these problems in a vectorized manner. And then we'll come to a few more complex applications of the building blocks we shared, both for financial data and in our department in ESG.
All right, so first off, what is vectorization? In the most simple sense, vectorization is transforming row-based operations. So if you have tabular data, taking something that instead of operating on individual rows instead operates on the columns in a vectorized sense.
And for our talk, we're mostly leveraging the NumPy library and also some pandas in our applications. So next, why should you care about vectorized code? TLDR, it's going to make your code faster. Probably for most of you, I could just jump to the next slide.
But maybe there are some people who are thinking, you know what, my code is fast enough for my needs. Maybe you own some offline that runs once a day or once a month and waiting several hours or however long it takes. It's fast enough. Why should I spend any time investing in my code that already is good enough to make it faster? I have three main reasons why you should consider this.
One, even if your offline process is maybe sufficiently fast for your production needs, having code that can run a lot faster hugely improves both the developer experience and your ability to test. So in our experience, we are able to make our code sufficiently fast
that we were able to test it on the whole universe on every PR. So it just hugely improves the ability to test and then we're not waiting like minutes every time someone sends a commit for CI CD. So it was really paradigm shifting from a developer experience and testing perspective. We were even able to test on data that we hadn't seen
so we ended up searching the input space. It's also more energy efficient. We work for ESG. Probably a lot of people here care about sustainability and green computing. So if you're using less resources than you need to, that's better for the environment, better for everyone. And third, from an accessibility perspective,
more efficient code is more accessible. Maybe in your company, in your corporate, we're at Bloomberg at a big company and if our code isn't fast enough, we can just buy more cores, buy more machines, split it up, more parallelism, throw money at the problem. But if you're contributing to open source,
you have to think about who's going to be using your code. Maybe they don't have those resources and so writing more efficient code is more accessible. Our results, we were able to vectorize the 241 functions in our model for an overall speed up of actually 12,000 times on about 32 million data points.
You may be thinking 12,000 times, like what does that mean, put that in context. That's like the difference between the speed of an airplane and a sloth. Or that's like running a full marathon in under two seconds versus like four hours. So yeah, really was paradigm shifting for us and hopefully can be for you too.
All right, now we're jumping into the building blocks. We'll start with something super easy. All right, so say we're doing addition. You have some X to some like NumPy array or Panda series and you want to add some constant inc to every element in your array. The slow way to do that would be to iterate through,
loop through every single element and add inc to every individual element and then return the output. Of course, most of you know the vectorized way to do this. We have built in both NumPy arrays and series, have this functionality built in. You can just plus and it will add it to all the elements. On 10,000 records, so we ran this 20 times,
the slow way is two hundredths of a second on a NumPy array. Interestingly, it's 10 times faster on a Panda series to do it the slow way. But the vectorized way is four tenths of a thousandth of a second. So that's a 60X speed up from the slowest to the fastest way.
All right, slightly more complicated example. Here we want to take a log. So the slow way to do it would be to iterate through every element in your list and use the Math.log function. But NumPy has functionality for logs built in. So this is a built in function here that can operate on your whole array
or series all at once. And this is 28 times faster on only 10,000 records. If else. So in this situation, X could be like a row with a couple of columns. So in this situation, we're looking at order data.
We have two columns. We have discount code and subtotal. So people who have the subway discount code get a 15% discount. People with the Friday code get a 10% discount and everyone else gets 5% off. And then here, how we would apply it in the slow way. So big X is like a whole data frame and we're using this pandas.apply
with a lambda function where we apply this inner function on every row. The vectorized way to do this has a couple of steps. First, we want to select the indices for each of the different cases that we want to apply. So the indices that we want to apply the subway discount to are all of the rows
that have the discount code equal subway, similar for Friday. And then the indices, IDX none, that's like for the default case. So we're using the numpy.any function. So not any of subway or Friday means that you're going to fall into that third else case from the left. And then once we have set our indices,
we're now going to apply the appropriate discount to each of those index sets. What's important here is that you have the index both on the left side and the right side of the equal side, otherwise you're going to have mismatched indices. And this is 14.6 times faster on just 10,000 records.
All right, so we looked at examples with 10,000 records for those first three building blocks, but you may be wondering, those speedups that we saw, is that constant? Like how does that scale with data size? So let's take a look. For the first example, for basic operations, when we scale up the data 10x from 10,000
to 100,000 records, our performance speedups go up by more than 5x. So on 100,000 records, you're approaching 400 times faster. For the math functions, not quite as impressive, but a 10x in the data is still nearly double,
like over one and a half times speedups. And similar for the if-else, 10x on the data size, and you're more than double on the performance. So it's pretty easy to see here. You're seeing numbers like 15. You're like, wait, I thought they said 1,000x, like as your data grows, so do your performance speedups.
Okay, so I want to jump into a few more building blocks. So for set membership, an example I would give is that you're doing data permissioning. So imagine you have some data products, and you need to check that for a given data point, a set of users is in a set of data products
that can actually access this data point. So for this, you can use MP.isAny, or .mp.isIn, and check that, okay, I have this vector x, and then I have this list y, and I want to check that an element of x is in the list. And this gives you pretty nice performance improvements. We tested this on 10 million rows, very nice.
Something that's interesting to note in this case is that if you use a set, you actually don't get much performance improvement, which is maybe counterintuitive to what you'd expect. So whenever you're vectorizing things, it's extremely important to play around in your IPython notebook, runtime it, and make sure that whatever change you're making actually is creating an improvement. The next building block I want to talk about is called broadcasting.
So broadcasting means that if you have two arrays that don't have the same shape, that you're able to broadcast the smaller array into the shape of the other one, and Numpy does this for you automatically. So for example, if you multiplied a three-by-three array by a three-by-one array, so like matrix by vector,
then Numpy will figure out, okay, the three-by-one is actually three-by-three, and we're gonna apply all those to each other, and it's like an element-wise multiplication. So that's very cool and important to know if you want to just read the code in the first place. But it also leads to performance improvements. So you can see, again, we tested on a million rows. Things are getting much, much faster. And then finally, I want to talk about
the most important building block, which is a library. So in this case, it's a simple one, just scipy interpolate, and that's gonna turn our vector from being in one area, zero to one, to being in another area, zero to 10. And if you call any scientific function pretty much in a for loop, like doing one example at a time, in some functions, in some cases,
that's the only thing you can do, especially if you're working like with the old statistics library, for example, then it's gonna be very slow, because there's a lot of overhead involved in actually invoking that function and doing all of the data linting required for that function to be confident and dropping into C. But if you use the vectorized version of it, and most of scipy and tons of the Python libraries
you might use are already vectorized for you, and would even work on a matrix, not just a vector, then it only has to do that initialization and cleanup and all that one time, and it'll be much, much faster. So generally speaking, your goal is to get things into a library and call that library, okay? And this is over 1,000 performance improvement we saw here.
So you might be wondering, okay, how does all of this work under the hood? So the answer is it's library stacked on top of library. So Pandas is gonna call numpy, and then numpy is gonna call lopac or blas, and then lopac is also mostly calling blas. So lopac, which is linear algebra package, is doing things like factorization or might solve a small linear system
like minimizing a two norm, and we'll come back to that two norm example later. Blas, which stands for basic linear algebra subsystem, is going to actually do the very low level things like matrix-matrix, matrix-vector, vector-vector. So for example, in the broadcasting example I showed earlier, that would probably be a blas operation that's doing like a matrix times a vector, three by three by three by one.
And I'll note that for a specific domain, you maybe wanna look outside of this, like for machine learning, like inference, and you're getting into like CUDA wrappers and stuff, that's gonna be 200 times faster than this even. Same thing for search, like you probably wanna look at a vector database or elastic search or what have you. Okay, and then how do these work? So there's optimization tricks, the most famous of which is SIMD,
so that's single instruction multiple data. That means that on a single processor, so by that I don't mean like multi-processing, I mean like a single processor, you're able to execute multiple instructions in parallel. So one addition happening four times at the same time on your CPU. And that of course gives you basically proportional speed
up to the number of steps that you're adding. And over here you can see a picture of like before SIMD and after SIMD. This is from a cool blog post by Stefan Hodges. And you can see like on the top, it's not vectorized and things are all like piecemeal and in a for loop, and after it's vectorized, things are nice and stacked together and fitting very nicely into how the computer
architecture actually works. The next trick is called blocking, which means that you're gonna break things into a smaller matrix and before you do your operation, and this helps be more hardware aligned like with the page size, and also helps you to leverage caching, so like some block might be reused multiple times and it can stay in L2 cache or something like that, and that gives you another 10x speed up generally.
Then finally there's threads, not too surprising, but you have to, if you wanna re-implement any of this stuff, you have to worry about that. So, same professor did a cool like graph of what each of these things added, so you can see SIMD and blocking and threads all are really important, but they still end up being worse than the library that already exists,
so you don't really need to bother, but it's kind of cool to know like what goes into this and how difficult it might be to replicate. All right, so Jonathan explained how this works for NumPy under the hood, but we wanted to give a non-Numpy example of just another creative way to think of what's another kind of algorithm or middleware that could be implemented
to provide these sorts of speed ups. We decided to choose the Fourier transform, so of course the Fourier transform is fundamental for signal processing and the discrete Fourier transform has a ton of computer science applications, primarily in the processing of image and audio data.
The discrete Fourier transform in its raw form is this very large matrix, which unfortunately is O1 squared, and image and audio data are really huge, so this becomes pretty problematic. The fast Fourier transform is an algorithm
that leverages the inherent symmetry in this original description and decomposes Fourier transform of dimension 2N into a much more efficient matrix multiplication of these two matrices. The first one, so the I's are identity matrices and the D are diagonal matrices,
and the second one is composed of zero blocks and then the Fourier transforms of half the size multiplied by some permutation vector. And this algorithm is O and log N, and so the fast Fourier transform was a huge speed up in the industry that was really able to accelerate
the processing of image and audio data. So this is another example of kind of what that middle layer could look like. For NumPy, it's slot back in blast, but in other applications, it could be an algo like this. And since all of these are actually matrix operations, we thought, oh, I wonder how much faster
the fast Fourier transform algo is versus if we just use NumPy vectorization on the discrete Fourier transform. So we took a pretty small, like just 96 by 96 discrete Fourier transform, and the vectorized form is quite a bit faster,
so from 0.18 seconds to 0.025 seconds, but the fast Fourier transform is still nearly three times faster than that, or 21 times faster than the old school unvectorized discrete Fourier transform. Okay, so now we've told you the what we did
and how it works, but I wanna talk a bit more about the why. So taking a step back, a lot of what we're showing you today, and what we will show you in the rest of the talk is really like what do you do if the vectorized tool you're working, like pandas or polars, doesn't have exactly what you need? Like how do you do that last mile of glue code? I kind of compare it to
if you're going down a really long hill in the rain, you might need to use your manual transmission to engine break. You're not doing that all the time, but it's really nice in that situation to know exactly what you need to do. Okay, so jumping into the mental model, with the set theory, is the most important concept you have to think about. So set theory means you have your if-else conditions and you're gonna convert that into unions and intersections.
So we have pictured here the law of intersection. And your goals here are to streamline the conditional logic, which means that first of all, you really wanna get rid of all the parentheses. By that I mean like after you convert your if-else tree it'll often be like very nested structure and you wanna get rid of as many of those as you can. And the second is you wanna have one operation type, so all ands or all ors. This lets you call MP any all together
or MP or all together, sorry, MP all or MP any all together. To give a quick example, and here the not thing is the not operator, if you had a condition like not A or B and C, you could apply to Morgan's law of intersection and turn that into not A and not B and C,
then you lose the parentheses and now you have only ands, now we call MP any and we're super happy, super performant, super memory efficient. The only drawback really, and this is a major drawback is this is O of K of N where K is the number of conditions which means that if you do this like 100 times it actually becomes slower and slower and slower and your original if-else tree might actually be better. So the next part of the mental model I wanna talk about
is how do you actually do this in production? So a lot of the time we don't get things as a numpy array which is very sad and I wish that the world wasn't like that. But a lot of time we might get something from a database, although this can sometimes go straight into pandas or like a JSON, something in a document or even just over RPC. So in this case we may need,
we may see the inefficiency comes in the serialization step and we actually have to vectorize that manipulation of the dictionary. So the focus can often be on like okay, how do I get into numpy? That can be where the slowness is. And similarly how do I get out of numpy? I've found that org.json library is really, really great for this. It usually gives me exactly what I wanna need
for like a rest API or something like that. And often the simple way is fast enough so definitely don't feel afraid to try that. You really only need to dig into it if it becomes your bottleneck. Okay, and then for the rest of the mental model, once you get things into numpy, into the vectors, you wanna stack them to a matrix and make sure it fits your library.
So as a cool example that we actually ran into, one of our modules was stuck due to dependency in Python 3.7. So the function in scipy we assumed we had wasn't actually there. So we had to re-implement it. So what did we do? We did all the calculations that we need to get things back into a matrix. So here you see like w which is a weights in x pictured.
And then we found the function that was there in 3.7 which is just pnorm. So we were trying to use a weighted pnorm which is like a weighted measure of distance versus pnorm which isn't weighted. And then we plugged it into the original one, the unweighted one and got what we wanted and moved on and we're happy with our lives. Okay, and then I want to kind of give an example
putting all of this mental model together. To quickly recap, you set theory and then you're gonna serialize into vectors and then stack into the matrix and then you're gonna call your library. So if using the callback, you have a really big if else tree and this is really common in finance because of like regulations and country specific logic and stuff like that.
You have tons of rules. You're gonna run into trouble because in place assignment is slow. This is actually deprecated in Pandas 3 if I understand correctly. And just calculating all of these different if conditions like as a vector is pretty inefficient. So a nice solution to this is kind of like quantization where you can have a sparse matrix C. So if you look up at the matrix there,
this matrix C is gonna represent your cases. So the rows would be like one case per row. So row zero might be a bad case, like a nan case. And then row one might be a great case. We got 10, we're super good. And then the third row would be the last case. We got a one, that's okay. It's not a nan, but it's not the best outcome. And then the columns would be like
the cases for each data point. So like the first column would be like, which cases apply for each data point. And the cool thing about this is that even if cases co-occur, so like both one and three happen, so no score allowed and also we got a one, we're able to apply waterfall logic with MP-argmax to always select the first thing that occurs.
The way that works is MP-argmax is gonna give you the index along the axis you're looking at with the highest value. So if you have only ones and zeros in scipy-sparse, then it's going to just give you the first one, which is exactly what we want here. And this is very memory efficient, because our 100 cases all became booleans instead of integers or int64s or whatever
they were gonna be by default. And it's also much, much faster. And this scales pretty well in all cases that I've tried out. All right, now to another example. Time series, prolific in financial data, so we interact with this a lot.
Here's a little snippet in the top of what the data looks like, so you can visualize. We have three columns, ID, which could be like a company or a country or an identity, and then our time is in the year column, and then we have some value. And we wanna get a rolling sum over some and number of years, and actually we don't just want,
like if n is five, we don't just want the five-year average, but we want the two, three, four, and five-year average for all integers like up to our n. So maybe the most intuitive and slowest way, well maybe not slowest, but the slow way to do this would be with a bunch of nested loops.
So first we loop over n in the range of our int, so like for n is two, three, four, five if we're starting with trailing window five. Then we have to loop over all the companies because we don't wanna mix data from different companies, and then for each of the years within a given company, we're going to loop over those, and then for each of those, we need to go back
like from 2000, well if we're doing five, like from 2005, we go back to 2000, plus 2001, plus 2002, et cetera, to produce our final array. So we have four nested for loops here, and as your data grows, of course, this is gonna be really, really slow. The vectorized way to do this
is going to leverage two NumPy functions, in particular, mp.roll and mp.nansum. So first I'm gonna pause and look at what those functions do and then we'll go back and see how we use them. mp.roll is like musical chairs with your data. So here we have a two dimensional array,
so we have like array with an array. So each inner array is the data for a single company, and then like the next company is underneath it. So you can choose over what axis you do this. We wanna do it over the zeroth axis, so within a single company. And it's musical chairs, so if n is two,
then this zeroth data point goes to the second index. But it also means that the data from the end loops back to the beginning. This is really important to understand because this is probably not what you want if you're doing financial data and time series. You don't want your 1998 average to be using data from 2024 and 2023.
So it's important to understand we have a step where we drop those early years. Of course we can't compute like the rolling average or the moving sum for the first couple years. So this is what our data looks like after we've rolled it.
And then we're going to apply nan sum. Same idea, but instead of musical chairs, we are summing over a specific axis. And in our application, so we have these, when n equals three, we have basically three different rolled tables,
like 2D arrays with our data set. So one is shifted by one, one is shifted by two, one is shifted by three. And then for each individual time period, we sum over this third axis. And it's nan sum because the nans are treated as zeros. That's where the nan comes in the name.
Okay, so how did we use this? So we've taken these four nested for loops and converted them just into one for loop. And again, that one for loop is not over the dimension of our data, it's just over the dimension of n. And that's only because we want to compute this like t minus two, three, four, five. If we didn't care about that, we'd have no for loops here on the right.
So that's essentially what we do. We roll it, we stack it, then we apply the nan sum. The next couple lines are some special edge case handling for how we wanted to handle our nans. Nan sum treats nans as zeros, but in our application, if everything was nan, we wanted the result to be nans, we apply that.
We reassign it to our original data with the column that we want. And then again, there's that years to drop, drop garbage results from the early years. Again, the average for assigned to 1999 and 1998 would be using data from the future, which in our application and probably your application is not what you want.
Huge performance gains here. So on 1200 records, over 100x speed up, the right versus the left. And on 12,000 records, so a 10x data size, we're approaching 500 times faster. So again, on really big data sets, becomes easy to see how we're getting improvements into the thousands.
Okay, another application. We have a lot of this kind of thing in ESG. I couldn't share real code, of course, but this is a generalization of the kind of thing we do a lot. So we have some field that we want to score, but it's only relevant for certain countries.
And so if the country is in some set of countries, then we're going to try and score it. And we have some waterfall logic. So first, we're gonna try and score field one. And if that's not nan, then we'll apply this scipy function, scipy stats norm cdf with some parameters.
But if field one is nan, then we're gonna waterfall to field two. And if the country isn't in this list of countries to be scored, then they just get nan. So similar to the example with the order data, with the subway and Friday discount codes, the way that we would, this is like the inner function, which we would then apply to the data frame
using the, in this application I showed how you can just loop through the rows. Or you could do the dot apply like before. Okay, so the vectorized way to do this, we're going to use three building blocks. First, we use set theory to identify the indices.
So even though we had three different cases, like nan, field one, field two, we only need two different IDX conditions, because then we can use set theory and knots to generate the third one. Second, leverage the vectorized version of a library function. So here, this scipy function, statsnormcdf,
already works out of the box on vectors, so there's no need to loop through these elements individually. And third, we apply the logic to the selected indices. So again, very important, whatever your indices are is on the left and on the right, so what you're selecting and what you're assigning to, so you don't get mismatches.
And on 10,000 records, there's this also approaching 5000x speedup. And these are not even very complicated applications, right? We have a couple nested if-else, and already we're getting huge, huge speedups on not even that big of data.
Okay, so I want to talk about a couple more building blocks, putting these at the end just to not drop too many of them. This one is my favorite, it's called HashLookup. So one use case you may have experienced is you have a web app. And there's some okay responses, and there's some 404s, and other types of things.
And these all cause actions to occur, and each of these actions has associated logs and maybe other associated events. And after all this happens, maybe you want to read these logs and do an audit that is decoupled from the application code to make sure that certain things that are important are being handled correctly. Maybe you're doing event sourcing
and you want to do a playback that's quick and rapidly fills out the tables the way that you would expect them to have happened on the real data. So if you're in this situation, often hashing is very important. You're doing a lot of like, okay, look up this thing, and then this applies to this code. So in that case, you can use this HashLookup trick,
and it always works as long as your hash is bounded. So that means like if you have numbers in your hash and they don't exceed some threshold, and you know they don't exceed some threshold, so there's like a max and a min, then you're good. If you have emails, it doesn't work, sorry. So you can convert that hash into an array, and this array would look something like in our example,
if two maps to four, then the index at two would have the value four, and this is like our map, and then you'd have your data in a matrix as normal or like in an array, and then you actually apply the array as the index of the map, and that's the opposite of what we usually do where we're indexing to the data. Just want to be really clear about that.
We actually index into the map with our data, and as you can see in this picture, that causes us to essentially apply the hash and get the correct result. This is super fast, like over 100 performance gain on something that's already extremely optimized in Python, which I find pretty exciting. I've definitely used this a lot for log analysis.
The next building block I want to talk about is sort. Now you can just sort, totally fine. Numpy has sort, but I want to talk to you about arg sort. So there's something called multi-index in pandas that is often used because it's helpful to keep all of your indexes together. If you call some function, it's generally going to know,
oh, there's a multi-index, so I don't want that. Let me drop that, remember it, go into Numpy, do a bunch of things for you automatically to make sure the index stays together, get my result, and then stitch that back in the multi-index and return it. Delegating that to someone else can be very expensive, often like a 10 to 15 times performance decrease depending on what you're doing.
So an alternative is to use arg sort, and this lets you have no multi-index. You just call arg, Numpy arg sort on your array, and you'll get back an index. You can use that index into your data to get a sorted set of data. Then you can use that data and apply it. You can apply a function to that data. It's like in this picture,
it's all the circles turning into stars, but now they're on the right order. And then once you're done, you actually need to return that to someone in the right order. So you can do arg sort on the arg sort index and apply that to the sorted data, and it will cover the original data. It's like arg sort to arg sort, and you get back to where you started. I think this trick is super cool, and it helps you avoid using pandas multi-index
or possibly have very strong control over how you're going to drop into a C function that requires data to be sorted for you to use it directly. Also really helpful if you're using pollers and don't have access to that and are responsible for your own index. And again, performance improvement here is really good.
So, okay, in conclusion, want to circle back on everything and quickly recap. The first thing you need to do is know about the building blocks. So pretty much every CS concept you've heard of before, there's some kind of building block for that. It's like the simple operations, the plus and the minuses, the logs, the if else. There's MP all and MP any. There's MP role for time series for hashing.
There's like this trick with where you index into a map with your data. For sub membership, you have MP is in. For sorting, you have arg sort and sort. And then there's, of course, all these amazing Python libraries, like ConvexPy and SciPy, that will support pretty much whatever you could want. And the way that this works
is that Numpy is using libraries like lawpack, linear algebra pack, and blast, basic linear algebra subsystems, which are implementing all of these things for you. So that's one of the things that's so amazing about Numpy is that over decades, experts have optimized these things for you and thought about all the ways this could be used and all the edge cases and bugs and delivered something that is very performant.
Then your mental model is gonna be, first of all, set theory. So just to recap, we wanna break things down into, we wanna get rid of parentheses, and then we wanna break things down into all ands and all ors wherever we can. And then also we can use this conditional waterfall trick to completely get rid of parts of the branch
because we can rely on argmax to filter out the bottom of the waterfall for us. And then the next part of the mental model is when we get data from somewhere, if it's not already in a Numpy array, we gotta serialize to a vector and then stack the vector and make sure that what we're doing fits our library. Getting that serialization step right is really important. Not doing that is what causes
a lot of applications I see to be IO bound when they don't really have to be IO bound. And then finally, for finance applications specifically, you saw that we have this conditional logic and also this cool time series case with MPRoll where you can implement a very complex function that maybe seems like it would be difficult to vectorize
and get a huge performance benefit. So putting all this stuff together, we were able to get over 12,000 times in our use case, which we're very excited about. And I hope that something in this talk has excited you or made you think of some code that you could use vectorization on or something that wasn't in Pollers that you can now optimize
and that some of what you've seen today will help you to navigate through the winding mountains of serialization. All right, so that's it from us. If you're interested in learning more about Tech at Bloomberg, there's a couple of links there. If you're interested in more about what we do on the ESG team,
Bloomberg.com slash ESG has more information or you can come find one of us to chat. Thank you. Thank you very much. We do have some time for some questions. So if you'd like to ask a question,
please come to the microphone. I also have a little bit of a thank you for you. I know we always have a no cookie policy, but here's a cookie for you. Oh, thank you very much. Thank you. Here's a cookie for you. Any questions? We have a question over here. First, thanks for the talk and for the insights. I'm curious that you mentioned the speed up factors.
Did you also research how they develop the more data you use? So sometimes you had like compared two amounts of data, let's say 10,000 items, 1 million, and there was a, normally it was even more, the factor of speed up was even bigger, but did you really research how that develops linearly or exponentially or anything like that?
In these specific examples, we just kind of took two points. We didn't dig more, but you absolutely could. Like, I think it depends case by case. Some are clearly like much more than linear, some sub-linear. We didn't do more research on that, but it would be very interesting. I think, yeah, a good rule of thumb
is that vectorization gives you benefits up to about one gigabyte. That's kind of what I usually look at, but you can always test for yourself. Thanks. All right, we have another question. Thanks for the talk. The speed ups you mentioned are really impressive. Something I'm, I don't know if it's a concern,
but it's more a question to see if you've seen this as a problem. When you apply all of these clever transformations to make your code more vectorizable, I get the feeling that the final code loses the initial intent or the programmer intent. It's more readable for NumPy and for the CPU,
but less readable for other engineers. So have you found any issues or lack of issues maintaining and modifying this kind of code? So there definitely can be a readability issue when you convert the code. That's one of the reasons that breaking things down with step theory and having a clear documentation
for how things work now is so important. And I think that as people gain experience, I find the code ends up looking pretty similar and pretty readable. But it's definitely a fair concern and something that is fair to worry about in your code. Yeah, yeah, it's definitely something that came up when we were talking about vectorizing this code base.
I don't know that it's less readable inherently, but I think because most people are trained in the old way of thinking, then maybe they, until you become fluent in the new way of thinking, it could be less. In our situation, actually, it was very helpful because it was so much faster. It enabled us to run very extensive accuracy testing,
and so that's what we did. We had our whole universe of data, and every time we vectorized, we had 240 functions with every single one. We ran it old versus new to ensure that we hadn't introduced any new behavior. And this is really important because definitely on the first run, we didn't get all the edge cases right. It was iterative.
So yeah, always need to be mindful. Anytime you're doing a refactor, but the benefit of this is because it's so much faster, it's easy to do testing. Thank you so much. I'll also say on the readability, we did this last year on one of our models and then used it as an intern project earlier this year
to apply to one of our other models, and our intern was hugely successful. So it might seem mathy and scary, but even an intern can do it. No, our interns were great. They were really bright, but it's not so advanced that you need to have an advanced math degree to do this. Also, we have cool internships.
Question from the other microphone. Thank you. Thanks for the talk. I was wondering or interested whether you compare to compiling your code with number or checks, whether there's any observations. That's a great question. Actually, there are a lot of cases with if-else where number will be faster,
but there were only a couple cases where it helped, so we were a little concerned to introduce something new into the ecosystem. We have time for another question here. I was just curious to know that since you moved to vectorization, there will be problems with in-place updates, right?
Since now the whole row needs, now you're kind of processing in columns rather than rows. Did you have a problem with data frames not able to fit in memory and now you have to update the whole table together? For our use case, not really. We have a nice machine with a lot of memory.
But it could definitely be a concern. Yeah, and in our application, even in our old application, everything was being loaded in. For us, it would have been a problem regardless, but it's certainly something you may have to consider. I think usually the best way to handle that situation when it comes up is first to block things, so try to divide up your problem
into maybe one-gigabyte blocks and run it that way. Cool, thanks. Well, thank you very much. I don't think there are any more questions, so can I have one more round of applause?