How we sped up NumPy’s string operations for NumPy 2.0
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 131 | |
Author | ||
Contributors | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/69458 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 202476 / 131
1
10
12
13
16
19
22
33
48
51
54
56
70
71
84
92
93
95
99
107
111
117
123
00:00
Operations researchString (computer science)DataflowMathematicsResultantObject (grammar)Function (mathematics)System callData bufferData conversionShape (magazine)Element (mathematics)Dimensional analysisTrigonometric functionsClique-widthQuantum stateOperator (mathematics)CodeSequenceString (computer science)NumberProjective planeObject-oriented programmingFunctional (mathematics)Element (mathematics)Shape (magazine)Line (geometry)Order (biology)Software developerControl flowMultiplication signDynamical systemObject (grammar)Data bufferDimensional analysisQuicksortInformationPresentation of a groupData conversionData structureRaw image formatMechanism designOperating systemTwitterRow (database)Semiconductor memoryQuantum stateArray data structurePoint (geometry)Two-dimensional spaceMereologyMathematicsType theoryTupleCore dump2 (number)Analytic continuationComputer animation
07:29
Pointer (computer programming)Array data structureFunction (mathematics)Data bufferoutputModel checkingPrinciple of relativityDesign of experimentsHand fanCone penetration testGEDCOMData Encryption StandardError messageInstance (computer science)InformationLine (geometry)Function (mathematics)Buffer solutionInformationLengthPointer (computer programming)ResultantFunctional (mathematics)outputString (computer science)Element (mathematics)Order (biology)Right angleLine (geometry)SummierbarkeitLatent heatSubject indexingIntegerArray data structureOperator (mathematics)Parameter (computer programming)NumberDimensional analysisData typeSequenceAlpha (investment)Boolean algebraLoop (music)Intelligent NetworkType theoryDescriptive statisticsResolvent formalismError messageMechanism designData bufferTheoryComplete metric spaceMessage passingSystem callClique-widthView (database)Interior (topology)BitComputer animation
14:58
Design of experimentsCloud computingType theoryOperations support systemPermianBinary filePhysical lawRootWrapper (data mining)Discrete element methodMaizeLaceCodeEmulationAdvanced Encryption StandardRun-time systemDecision tree learningCASE <Informatik>String (computer science)Element (mathematics)Alpha (investment)Clique-widthVariable (mathematics)Interior (topology)Semiconductor memoryResolvent formalismParameter (computer programming)String (computer science)Message passingArray data structureNamespaceElement (mathematics)Exception handlingAlpha (investment)ImplementationBitLoop (music)Order of magnitudeResultantType theoryMultiplication signData miningOrder (biology)Different (Kate Ryan album)Object (grammar)Overhead (computing)InformationNumberData typeBuffer solutionFunctional (mathematics)Function (mathematics)Amsterdam Ordnance DatumRight angleWrapper (data mining)Casting (performing arts)CASE <Informatik>Error messageSpacetimeView (database)Figurate numberUltraviolet photoelectron spectroscopyVariable (mathematics)Data bufferRaw image formatClique-widthComputer animationSource code
22:27
Addressing modeRouter (computing)Object-oriented analysis and designSatelliteState of matterQuicksortWrapper (data mining)Projective planeDifferent (Kate Ryan album)Line (geometry)String (computer science)CodeParameter (computer programming)Default (computer science)Multiplication signUnit testingExecution unitSimilarity (geometry)Point (geometry)Link (knot theory)Error messageTraffic reportingBitBuffer solutionCasting (performing arts)ImplementationType theoryCASE <Informatik>Software bugComplex (psychology)Functional (mathematics)Software testingVariable (mathematics)Meeting/InterviewLecture/ConferenceComputer animation
Transcript: English(auto-generated)
00:04
I'm going to be talking about NumPy 2.0. Recently, I spent some time on string ufuncs and doing string operations in NumPy much, much faster than what was being done before.
00:21
My name is Lisanros, this is EuroPython 2024, we're in Prague, and it's the 10th of July, 2024. Most of my time I spend on Python stuff. I work for QuantSight Labs, I'm a CPython core dev, I work on NumPy a bit, and I also co-organize PyGreece
00:42
and PyBerlin, which are two meetups, one in Greece and Athens, and one in Berlin where I live. So NumPy 2.0, it was released about a month ago. Lots and lots of exciting stuff in it, changes and improvements in the C API,
01:00
the Python API, docs improvements, everything that we worked. Yes, with it came deprecations, came removals as well that might break code, but it's an awesome development. And actually, it came out about 18 years after NumPy 1, so NumPy 1 is now old enough to drink.
01:24
And yeah, it was time for 2.0. One of the things that came out in 2.0 was fast string operations, and this is what we're going to be talking about today. Yes, so fast string operations,
01:43
it began as a project by looking at this. And this is C code, but it's very, very easy, what it does. Basically, it's how string operations used to work before NumPy, what it was doing is the first line there,
02:01
it gets the string method, the Python string method, that is, no C function there, it just gets a Python function. Then it converts the NumPy array item to a Py object, that's an internal structure we have in C Python,
02:24
then calls that Python function with that Py object, and then does the conversion back to a NumPy array item. So that means, as you may imagine, this is very, very inefficient. For every item in an array,
02:42
if an array is 10,000 items long, then for all the 10,000 of those items, we need to do this. So to recap, we have to convert the C data buffer to a Python object, we need to call a Python function with a Python object, and then the resulting object
03:04
needs to be converted back to our presentation that NumPy can understand. So the idea was, okay, why do we have to do this? Let's just do it all with a raw data buffer that we have in C. And so our talk today is going to be about the journey
03:24
of converting all of that to working with a raw data buffer. In doing that, we first need to examine what NumPy's ND array is. An ND array, if we have this array, for example, that's a NumPy array, what that is,
03:45
if we assume that it's an int64 array, we have a C data buffer up top, and then we have some information about the data buffer and how to interpret that. NumPy keeps all sort of information like ND-ness of the operating system,
04:04
or whether it's a C or a Fortran contiguous array, that means whether it's row major or column major, the way that it's stored in memory. But the most basic of information is this. What's the element's size?
04:20
For example, in an int64 array, it's eight. What are the dimensions? Or better expressed, how many dimensions do we have? Here we have rows and columns, that's two dimensions. The shape, that's two comma three. And the strides, the strides are very, very interesting.
04:43
Mechanism to describe how can I go to the next element. Because remember, we have a data buffer that's just a sequence of bytes. It doesn't have any information about how many bytes do I need to jump in order to go to the next element.
05:00
So the strides is basically a tuple that says, it's as big as the number of dimensions, and the 24 represents the number of bytes I need to jump in order to go to the next element in the first dimension. The eight is how many bytes do I need to jump forward
05:20
in order to go to the next element in the second dimension. Because this is a two by three array, three times eight is 24, and that means I need to jump 24 bytes in order to go to the first element in the second row, basically. If it's an in32, it's basically the exact same thing,
05:42
only the information changes. The data buffer only changes very, very slightly, it's just the information on how to interpret that data buffer that changes. So yeah, with that, we need to ask ourselves, okay, if it's only the information that changes,
06:02
then what happens with arrays, right? With strings in it. Is the memory layout like this, where we have high prog exclamation point all in a continuous part of memory?
06:20
And if so, what's the item size? Remember that the item size is static, it's not dynamic, so we can say the different items have different sizes, it needs to be one number for an array, right? So what NumPy does is something like this. It appends zero bytes to the end of the items
06:43
that are smaller than the biggest one, and then sets the biggest item's size to the size of the array itself. So for example, for this array, because proc is six characters long, and obviously here we assume that this is a byte array,
07:00
so each character is just one byte, because proc is six characters long, the basic size is six bytes, the dimensions, the number of dimensions is one, the shape is three, and the strides is six. And so we have those zero bytes,
07:21
and that's why, there you go, whoops. That's why we call these D types fixed width string D types. The fixed comes from the fact that all of the array items have the same size, and zero bytes are just appended to the end,
07:42
so that we can keep that same size. So do you remember the problem from before, which said let's do it all in C? So what we can do is basically just this, call a C function operating on the C buffer. And the question that arises is, does NumPy have a mechanism for that,
08:01
and the answer to that is yes. NumPy has a mechanism called ufuncs, which is just that. You can create a C function that operates on the data buffer of the array, and in order to do that you have to do some things.
08:21
The documentation of NumPy includes this quote, it's too much theory and not enough information about how to actually use it, so I just put it there for completeness, but let's just see how to use it. In order to do a ufunc you have to write one function
08:41
that's called whatever you want to call it, but it takes four arguments. The args, the dimensions, the steps, and the data. And NumPy itself calls these functions and provides all the arguments, so you just need to write this and register it,
09:02
and that's all. And the arguments here, args is an array of pointers to the data buffer, so if you have, let's say, a ufunc that operates on two arguments, then args will give you pointers to the data buffers of the two arguments, and it'll give you a pointer to the data buffer of the output.
09:22
Dimensions is a pointer to the dimensions. Steps is the strides we talked about that tell you how many bytes do I need to jump in order to go to the next element, and then data is irrelevant. So when we started doing this, we chose isalpha as the first thing to do,
09:42
because isalpha is the easiest thing to do, so I just wanted something very, very easy. And this is what that ufunc looked like for isalpha. We have isalpha takes just one argument and produces one output, so we get the data buffers for in and out,
10:03
we get the number of dimensions, which is the number of steps we need to do, we take in, which is the data buffer, we pass that to a string isalpha function, they'll do the right thing, no need to care about this here, and then this returns a bool, we cast the out data buffer to a bool star,
10:25
and then write to that the result, and then the steps basically give us the number of bytes we need to jump. And remember how I told you that the data buffers are just sequence of bytes, that's signified here by in and out
10:40
both being just car stars, they don't have information about whether they're storing ints or whatever it is, we do that all by casting to the appropriate type. This is how to write a ufunc, it's very, very easy, it's just a one-dimensional loop,
11:01
and you just jump forward to go to the next element. So the next function was add, and add is a bit more complex in that it also returns a string, remember isalpha returned a boolean, so boolean's really easy to return
11:20
because it's a static thing, we know that a boolean is basically one byte in numpy so we can cast to a bool, and that was it. But if you tried to do the same thing and just write the one-dimensional loop for add, you would get this error, and you don't need to read it all, it just says that you need to provide
11:42
a result descriptors function. What is that? The output descriptors are basically information about what the ufunc returns. Remember how we said that strings are parametric in that the element size can change.
12:02
If the string is six characters long, then the item size is six, if it's more than that, then it can be 10 or 15 or whatever it is, right? But when you are defining a ufunc and you need to return a string, you need to tell numpy how big of a string do I return from this.
12:20
And the way to do that is with this result descriptors function, which basically the two arguments that are important here is given descriptors and loop descriptors, given is what the descriptors are right now, and loop descriptors is something you have to write to and say these are the descriptors
12:41
that I want to have in the ufunc. And if you check this, you'll see that the np.add result does a unicode six, so that means that the first string is abc, three characters long, the second one is def, three characters long as well,
13:02
and then the resulting string is six characters long, so we need basically to add the lengths of the two buffers. And in order to do that, this is how we do that. This is really not important, it just says use whatever you have now for the two input descriptors.
13:23
And for the output descriptor, we can just, you'll see that the important line is where we increment lsize by the lsize of the second element. So what that does is basically says that the output array's length
13:43
will be the input array's length sum. And with that, add worked, and then we had to go to the next functions, and the next functions were find and rfind. For find and rfind, there was nothing really new,
14:03
but there was one thing that was kind of tricky, and that was that find takes actually a start and an end index that you want to search in. So you can use find with just strings, but you can just pass two integers there
14:22
and say I want to search between the first and the fourth index, or something like that. And if you, remember that view funcs operate on specific data types? So we write a view func to operate, let's say, on int64s, but we don't really want to operate
14:41
to write a view func for int32, int16, int8, and all of these other data types. But if we were to pass Python ints or some other data type other than the loop we've written, then we get this, that basically we don't have implemented a loop that will operate
15:00
on the correct data types. And in order to circumvent that, there's something called a promoter, and the promoter is basically information about how to promote certain data types to other data types. Let's say someone passes in an int8 array to a view func,
15:24
we want to cast that to an int64 and then call the loop that implements the int64 view func. And to do that, promoter gives us, again, two arguments that are important,
15:40
OPD types and new OPD types. Similar to what we saw before, OPD types are the D types that the user passes, new OPD types is something we need to write to and say, okay, cast these data types that you took to these other data types I have implemented a loop for, and NumPy will do the right thing,
16:01
cast and then call the correct loop for it. And the way to do this is basically here we say, okay, whatever data type you took, just use that, so no casting occurs, and here we basically say very, very strictly for the third and the fourth arguments,
16:21
the data type needs to be an npy int64, and that's an int64 data type, and so NumPy will try to cast whatever it has to an int64 before calling the U func. For the output, it's not really important here, but you can set it to something else as well.
16:42
So after installing the promoter, we can pass int8, int16, int32, whatever ints we might have, we can pass to find, and it'll do the correct thing, it'll cast to 64, and then because we've implemented a 64-bit loop,
17:02
this is what will be called. So the last thing I'm going to be presenting today is replace, because replace has many, many challenges. The main challenge is that, remember how in add we had the size and we knew about the size upfront?
17:21
We knew it's going, if we're adding a three character string and a four character string, it would be seven characters long. When we're doing replace, we don't really know how many replacements we'll do, or stuff like that. And so the problem is that the output size is not known upfront,
17:41
and thus we really can't implement that resolve descriptors function that we saw before. And in order to circumvent that, we came up with this solution. This is a Python wrapper, no need to read this, but basically what it does is that it counts the number of occurrences
18:01
of the thing that we want to replace, and then it computes the difference in characters between what we're replacing and what we're substituting in, and then basically takes and allocates enough space
18:22
for all of those things to fit in memory. And then if you see here, it calls the loop with an out keyword argument and the out keyword argument is, you don't need to figure out, you don't need to allocate memory
18:41
to store the output of the U func, I'll give you the memory where you need to store this. And we need to slightly change the resolve descriptors func to say, if you've not given me an out,
19:01
then I'm gonna raise an exception, because we necessarily need the memory that where we're going to be storing things, and if we don't have that, then we can't do it. So this was replaced, it wasn't that hard, but we needed a Python wrapper for that,
19:21
and you'll see here that if we call the U func with without the out, then it raises a value error and says the out should be given, and if we call it with the out, then it does the correct thing. The Python wrapper is in a new namespace called np.strings,
19:41
and that's what you should always be using, because you don't need to pass out or whatever else, it basically does everything automatically, but has a tiny bit of Python in front of all of the C stuff. Now about the results, was it any faster after all?
20:02
And the answer to that is yes, it was much, much faster. For example, add was the one that had the most significant speed up, it was about two orders of magnitude faster. In case you can't read it, that's 492 times faster,
20:24
that's not 4.92, that's actually 492 times faster, and that's only for a thousand element array, for even bigger arrays it will be even more of a speed up, and that's that overhead of having to create
20:42
Python objects back and forth and calling Python functions, no need to do all that, just operate on the raw data buffer. So yeah, the speed up was amazing, and even for the most basic of functions, like is alpha, we had speed ups upwards of 100x for relatively small arrays like thousand element arrays.
21:04
But that's not all, NumPy 2.0 has another very, very nice update, and that's variable with string D types. You can imagine that if you have arrays with thousands of elements, and you have fixed width
21:23
elements, then that means you're wasting bytes adding zeros all over the place, and that's not memory efficient, not at all. And while all of this was brewing, a colleague of mine was working on variable with string D types, and what this does
21:43
is basically gets rid of all of those zero bytes, and you can have string arrays where the underlying buffer only really has the data that you have in the array, and no zero bytes, nothing like that.
22:01
There are some caveats to that, so if you want to read about it, I think the NEP, the NumPy Enhancement Proposal is 55, NEP 55, if you want to read more about that. So yeah, that's it on my side. I'll take any questions you might have.
22:30
Thank you, so we have five minutes for questions. I think we can take about two or three questions. Okay, one thing I'm wondering is,
22:41
as you're writing these ufuncs with complex casting and buffers and stuff like that, how do you test your code to make sure that there's no possible complex error cases created? We're testing extensively with just having unit tests
23:04
for all of the Python wrappers, so for every ufunc we write, we basically have a Python wrapper that does things like default arguments and stuff like that, and then we call that, we have unit tests for those Python wrappers.
23:24
That's not to say that there's no bugs, there is bugs. In fact, this morning there was a bug report about strip not doing the right thing now in NumPy. That's been fixed though, and yeah, I mean, there's always going to be bugs, we'll fix them, and I don't think there'll be too many.
23:46
So if you don't mind, I wasn't accusing you of writing bugs. I was more saying, what advice would you give me so that I can stop writing bugs? I mean, the approach in the NumPy code base
24:02
is to just write as many tests as possible, trying to catch as many edge cases as possible, and then hope for the best, I guess. Probably, if I knew more about NumPy, I'd probably already know the answer to this.
24:20
In coding, so all the examples are ASCII. The links fixed byte sizes, or presumably must be rather than characters. Is it UTF-8 internally or what? So fixed with D types are only bytes or Unicode, but that's UTF-32, and all of that
24:43
was for both of those D types. The new string D type has support for UTF-8 as well. I know that Pollers, prior to their 1.0 release, did a big internal rewrite of how they handle strings, mainly for performance.
25:01
Are there similarities between what they're doing and what you're doing, or are the two tools just aiming for different things? I didn't get what project you were talking about. Pollers, sort of one of the replacements for Pandas. Yeah, I'm not sure about the string D type because I think that Pandas has had a D type
25:22
for variable with strings for a long time now, and Pollers might have adopted some of that functionality. I'm not sure I can say much more than that. Sure, I think Pollers have diverged to a new implementation, German-style string implementation. I don't know all the full details, but in a bit of a speed, but maybe it wants to take offline, thank you.
25:43
Thank you.