Addressing multithreading and multiprocessing in transparent and Pythonic methods
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Alternative Title |
| |
Title of Series | ||
Number of Parts | 132 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/44969 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 201836 / 132
2
3
7
8
10
14
15
19
22
27
29
30
31
34
35
41
44
54
55
56
58
59
61
66
74
77
78
80
81
85
87
91
93
96
98
103
104
105
109
110
111
113
115
116
118
120
121
122
123
125
127
128
129
130
131
132
00:00
Address spaceMehrprozessorsystemSoftwareElectric currentState of matterParallel computingMiniDiscCore dumpKernel (computing)Time domainObject (grammar)Extension (kinesiology)Software developerSpacetimeProcess (computing)Information technology consultingMultiplicationSoftware frameworkType theoryParallel computingFocus (optics)Morley's categoricity theoremAreaLibrary (computing)BefehlsprozessorServer (computing)BitNumeral (linguistics)Formal languageSimilarity (geometry)MereologyOcean currentState of matterExtension (kinesiology)Arithmetic progressionGoodness of fitKernel (computing)Computer configurationPredictabilityMathematicsSoftware developerImplementationOnline helpTask (computing)Multi-core processorModule (mathematics)LaptopSpacetimeAdditionFerry CorstenDataflowVector processorCountingAbstractionExpected valueFlow separationObject (grammar)Heegaard splittingInterpreter (computing)Core dumpMehrprozessorsystemVenn diagramTwitterInstance (computer science)Game controllerSingle-precision floating-point formatCartesian coordinate systemLevel (video gaming)Right angleNumberLine (geometry)System callRoutingExpressionOperator (mathematics)Computer animation
08:47
AreaParallel computingRandom numberSoftwareType theoryOverhead (computing)Cache (computing)BefehlsprozessorAddress spaceIntelTask (computing)AerodynamicsScheduling (computing)BuildingLibrary (computing)Block (periodic table)CodeCoordinate systemDynamical systemSurjective functionModul <Datentyp>Process (computing)Fluid staticsMultiplicationMassChemical affinityPhysical systemParallel computingCodeSpacetimeFluid staticsSet (mathematics)Auditory maskingArea1 (number)DialectRule of inferenceModule (mathematics)Focus (optics)Type theoryDistribution (mathematics)Scheduling (computing)Cache (computing)Software frameworkOverhead (computing)Machine learningBefehlsprozessorLaptopSoftware maintenanceMultiplication signBlock (periodic table)Pay televisionRandomizationMathematicsGame controllerProcess (computing)Dynamical systemTask (computing)Patch (Unix)Element (mathematics)Open setScaling (geometry)BitSystem callIntegrated development environmentOnline helpForm (programming)Extension (kinesiology)Address spaceVariable (mathematics)Revision controlAffine spaceSoftwareThread (computing)MehrprozessorsystemDifferent (Kate Ryan album)Library (computing)NumberCartesian coordinate systemDescriptive statisticsRight angleLevel (video gaming)IntelLatent heatImplementationMultiplicationAddressing modeDirection (geometry)Lie groupAugmented realityPhysical systemCASE <Informatik>Computer animation
17:13
Repository (publishing)User interfaceMultiplicationState of matterElectric currentCodeLibrary (computing)Directed setNetwork socketSoftware frameworkDemo (music)Address spaceServer (computing)MultiplicationLoop (music)Standard deviationLevel (video gaming)Scripting languageFormal languageMathematicsRight angleLaptopCartesian coordinate systemParallel computingLibrary (computing)Interface (computing)CodeOcean currentMehrprozessorsystemDifferent (Kate Ryan album)BitNumberType theoryProcess (computing)Physical systemMulti-core processorModule (mathematics)MereologyAugmented realityMultiplication signVirtual machineSet (mathematics)Dynamical systemDescriptive statistics2 (number)Parameter (computer programming)AreaElectronic visual displaySpacetimeCore dumpFocus (optics)Pairwise comparisonRepresentation (politics)Drop (liquid)Intermediate languageDivisorProduct (business)Default (computer science)Online helpRange (statistics)System callSlide rule1 (number)RandomizationHypercubePay televisionVector processorRewritingComputer animationSource code
25:05
Frame problemInterface (computing)Parallel computingCodeLatent heatTime domainMultiplicationBeat (acoustics)Object (grammar)Beer steinMassLevel (video gaming)Module (mathematics)Software frameworkOperator (mathematics)Library (computing)Domain nameThread (computing)Latent heatPatch (Unix)ImplementationInterface (computing)Standard deviationSimilarity (geometry)Source codeSystem callDemo (music)CodeCore dumpParallel computingComputer configurationSpacetimeAxiom of choiceQuicksortDirection (geometry)IterationType theoryPoint (geometry)Dependent and independent variablesSoftware developerProduct (business)Pay televisionAddress spaceOpen setGoodness of fitLimit (category theory)AreaMultiplicationRight angleJust-in-Time-CompilerInstance (computer science)Sheaf (mathematics)Set (mathematics)Object (grammar)Formal languageExpressionNumberForm (programming)Computer animation
32:36
SoftwareDrop (liquid)Interface (computing)Physical systemParallel computingComputer animation
33:13
IntelSoftwareType theoryGame controllerLibrary (computing)Process (computing)MultiplicationComputer configurationDynamical systemMultiplication signRun time (program lifecycle phase)Profil (magazine)MicroprocessorFluid staticsBlock (periodic table)Goodness of fitSpacetimeBitPerspective (visual)Parallel computingCoefficient of determinationInteractive televisionTelecommunicationLevel (video gaming)Different (Kate Ryan album)Interrupt <Informatik>BuildingOpen sourceCartesian coordinate systemComputer programmingPairwise comparisonConcentricSet (mathematics)System callPhysical systemAddress spaceCodeAbstractionLink (knot theory)Single-precision floating-point formatMulti-core processorNetwork socketDescriptive statisticsDirected graphOrder (biology)Mathematical analysisMereologyBefehlsprozessorFile formatWeightAreaComputing platformCASE <Informatik>Information technology consultingDistribution (mathematics)Task (computing)IntelWindowNumberProduct (business)Open setRight angleDeterminant2 (number)Revision controlCache (computing)Game theoryPattern languagePay televisionComputer clusterUtility softwareComputer animation
Transcript: English(auto-generated)
00:06
Thank you so much. Good morning. So my name is David Lu. I'm a Python technical consultant engineer for Intel, and today I'll be talking about addressing multithreading and multiprocessing in transparent and Pythonic methods.
00:21
So just kind of a general overview of this talk, one of the things I'm going to do is kind of state what the current state of concurrency and parallelism is in the industry. I'm going to talk a little bit about nested parallelism and oversubscription, what those problems are. We're going to talk a little bit about composable methods and thread control, and then how some of these packages work under the hood that address these issues, what it means
00:45
to have a Pythonic style, and kind of the future of Pythonic style for parallelism. So one of the things I kind of like to do is, the Python language itself has had a lot of luck in attracting good talent and a lot of the best people for addressing
01:07
concurrency and multiprocessing. And so we can see that lesson here. We're one of the few languages that encompass all of these frameworks. And over the years, you can see the progression of the frameworks has kind of included just general threading, multiprocessing, task parallel type of workflows,
01:28
and you can see that the large amount of packages that are now in this space help fill out the Python ecosystem such that we do have a lot of options when we choose to go for parallelism.
01:41
And we're one of the few languages to have that. And from 2008 to 2017, you can see the large amount of packages that we have, and a few of those packages have actually been talked about at this conference. So, again, like I said, the options in this space are very good compared to other ecosystems,
02:01
and the majority of them do a very good job playing nicely with the global interpreter lock. If you were expecting this talk to get rid of the global interpreter lock, this is probably not the talk for you, but, you know, we do a very good job by doing distributed vectorization techniques or by working nicely with the GIL, and that's one of the
02:24
packages that are included in this ecosystem. For the more domain-specific areas, one can rely on high-end C libraries to do that type of work for you to harness parallelism and threading. So scipy and numpy do a great job of this, right? So, you know, when you make a numpy call, under the hood, it's calling a C library,
02:42
which is doing the majority of the data parallelism work that is required to get the job done quickly. And with that being said, you know, one of the recent trends in the industry is increasing amount of core count and thread count, and that's becoming more commonplace in the server space and even in your laptop space that you have.
03:02
You're seeing an increasing amount of cores and threads that are becoming available. And because of that, nested parallelism and oversubscription are now quite possible in the kernels that you're doing. And so some of you may be asking, well, what exactly is that, right? And we'll go into that in a little bit.
03:20
But let's first talk a little bit about the gill, you know, because this topic gets talked about a lot, right? The gill has been complained about by many people in this space, and many efforts have been made to remove the gill. You know, there's just a few talks in the last few years at PyCon that have been trying to do it. There's been a lot of efforts to remove it, and, you know, there's very valiant efforts
03:41
to remove it too. But as it stands, you know, what the gill provides us is relatively important, and it's kind of hard to ignore some of those, right? You know, the read-write safety of Python objects, predictable behavior, right? The language really wasn't written to be thread-safe, and, you know, the guarantees that you get with types and everything come from the guarantee
04:00
that the gill provides you. And in addition to that, you know, when you're developing your own modules and extensions and other things, that type of expectation on the developer is a very hard expectation is to say, well, if I'm developing a framework, I now have to expect that, you know, I have to, you know,
04:21
it's not going to be single-threaded, other people may be accessing my objects. That's extremely hard to test. So, you know, passing that burden onto the developers is also not that great of an idea. And again, that's why the gill provides something to allow you to be able to easily work and create extensions for Python.
04:42
And again, because the gill is, it provides that safety, and we have so many good frameworks, it's kind of a non-issue today, right? You know, there's many frameworks that have found a way to cleanly step around the gill, and SciPy and NumPy are great examples of this, right? You basically send a command for, you know, numpy.dot or something similar.
05:04
It gets dispatched to, you know, your BLOS API, and, you know, you can then use the Intel's math kernel library, or you're using OpenBLOS, depending on your implementation. That gets vectorized and parallelized inside the CPU and gets dispatched completely
05:21
transparently to you, right? So NumPy and SciPy do an amazing job of this, and that's kind of one of the examples of cleanly stepping around the gill by understanding what that data flow is. And again, there's a lot of other frameworks that utilize this type of vectorization, right? You have Numba, NumExpression, Cython all do this type of vectorization work
05:41
for you while allowing you to stay within the Python layer. Multiprocessing frameworks that have now been included into the main library of Python as of Python 3, right, have great ways of escaping via separate processes, not necessarily just, you know, stepping from the vectorization line, but you can also have separate threads within those,
06:03
and that's where, you know, some of the over-subscription problems can happen. Generally, exiting the gill with a C library is the most Pythonish way of doing things, and this has been talked about by a lot of people in the numeric space is that, you know, if you understand the abstraction
06:22
of your computational flow, you can write a library that can do this type of work, wrap it in Python, and that essentially is the most Pythonic way of operating, and this composition of abstracted flows by splitting, you know, you can also do this by splitting off into multiple processes, can also be a cleaner way of escaping the gill.
06:44
And, you know, it's very rare to absolutely necessitate the language to be thread-safe. I mean, there's very few instances that we would ever need to really do that, and I think that issue of the advantages of Python going away would probably be
07:01
the main detractor if we started doing that. So, if we start breaking up the space now into three main areas, so, you know, with this Venn diagram, if we look at application-level parallelism, we look at single-threaded concurrency and data parallelism focus, we can split up the majority of the frameworks in this space and kind
07:22
of categorize them and see what areas overlap. So, you know, you see the area that has been talked about a lot with, you know, Trio and Tornado or Celery or anything really that lies within concurrent futures. Yeah, that's another big thing is, you know, when Python, you know, came down to say, the current futures are the API that we want
07:41
to really support, that was a huge area because now a lot of these frameworks are now designing towards it. You can see where that area is more of like the single-threaded concurrency and when most people think that they need parallelism, they most likely just need concurrency. And then when you get to application-level parallelism,
08:00
you're seeing like multiprocessing or JobLib or similar frameworks being in that space. Dask also encompasses part of that framework. When you get to the data parallelism focus, you can see that the packages that we talked about in the numerical space, NumPy, SciPy, Numba, Cython, NumExpression, all sit in that area because they understand
08:21
that the data parallel is the area that they want to focus in. And by, you know, abstracting that call, you can then exit the guild, do that type of data parallelism type of work while being able to return all of that back into the Python layer. And then when you get into areas where you needed both single-threaded concurrency and data parallelism, you know, you can get like MPI for Pi
08:42
or some really weird types of concurrency and data parallelism focus that will lie in that area. And then the center area, obviously, maybe it's a unicorn, maybe it's MPI for Pi, like, you know, obviously, that's also a little harder to work with. So this hopefully will give you an understanding of what the different areas are encompassed.
09:03
And what I want to do with this, though, is try to focus down and talk about two specific areas today. So if we're going to take a look today at application-level parallelism and data parallelism focus, this is where a lot of kind of the final frontier has been sitting. And so if we expand that now into three areas,
09:22
we have Python multiprocessing, Python multithreading, and then data parallelism focus. So now you can see where, you know, some of these frameworks now lie and, like, Dask is clear in the middle of it. It's one of those actually interesting type of frameworks. If we were in the US and Matthew Rockland was here, he'd be very happy as he's one of the main maintainers of Dask.
09:45
But that being said, now that we understand what space I want to talk about today, this area that's kind of the intersection of them is where nested parallelism and oversubscription can occur. When you start mixing these different libraries, multiprocessing with NumPy or Numba with, you know,
10:03
other elements that have been composed on top of, or you start getting to multithreading, this is the area that oversubscription and nested parallelism can occur. So you may ask, what does that actually look like? The answer to that is it can look like relatively benign code, right?
10:20
So for many of you, this may look like a very, very simple type of thing that you would run into if you were just developing a NumPy or you were just trying to, you know, scale just a little bit. You know, so here we have a NumPy call with random. We have multiprocessing pools with a thread pool. And then we're going to do a pool.map on a NumPy call.
10:42
Well, what exactly actually have you done here, right? The problem is is you've now done a composable type of nested parallelism without even knowing it. And that's where it can get really, really scary because now you can have threads being spawned in a nested parallel process.
11:01
And if you start putting this on a larger compute system, it can go out of control. So, you know, you go from p threads and then, p Python threads, and then you go to the threads that are in NumPy. Well, then what will happen is that nested capability will then create like nearly double or quadratic the amount of threads if you're not careful, depending on what's available in the system, right?
11:23
And so you go from a relatively known set of threads that you're like, okay, you know, this one called a NumPy, I know what's going on, right? Well, now if I call that with multiprocessing on top of it, I've just kind of created a mess and tangle of threads because now one can spawn a bunch of other ones, but it's relatively uncapped because it doesn't have any rules being passed down.
11:44
So, you know, what are the problems that you have with that? Is you essentially will get over subscription, right? And so you'll have a lot more threads than are actually mapped to the CPU and they're also then mapped to other ones. So it's not, it doesn't have any rules controlling it. So with that many threads, you'll have direct over OS overhead
12:02
for switching out threads, CPU cache becomes cold. You're going to get a performance hit and you're going to say, well, this worked actually faster on my laptop. How did that happen, right? It's kind of an invisible impact if you're not used to it. And other threads are kind of waiting for the other ones to return and it's just like, it's just trying to, you know,
12:21
has way too many threads to the actual logical cores. Now, a lot of the popular frameworks that have this problem have solved it in a relatively simple-ish way. It's not the cleanest way. And what they do is they lock the amount of threads to one for that specific process, which is an okay-ish solution,
12:43
but it doesn't scale well. You know, they'll set OMP numb threads, they'll set the block time to lower, but it's not always, you know, the cleanest way. And, you know, Scikit-learn definitely has this. If you use grid search, you'll see it. PyTorch, TensorFlow, they all exhibit this problem
13:00
because the type of composable parallelism that they use to give you the type of either the machine learning or other forms of work, this is one of the issues that they run into. And, you know, I'll talk a little bit more about SMP when we get to it, but, you know, SMP is one of the packages that addresses it. So now let's talk about the composability modules
13:22
that help address this space, right? One of them is tbb4py, and, you know, from our Intel distribution for Python, tbb4py is included with our distribution, it's free. And what it is is actually a Python C extension for managing the nested parallelism using a dynamic task scheduler.
13:42
So if you use our version of Scikit-learn in our Intel distribution for Python, we actually use tbb under the hood for some of those. But when you start looking at what it's providing, this is kind of the focus today, is what exactly is it providing as the dynamic task scheduler for that? So if you have dynamically mapped tasks
14:01
and you have ones that will occasionally end up a lot faster than the completed ones, it's able to, you know, put those back into the thread pool and allow you to spawn new ones, even if they're unbalanced. So it handles unbalanced work relatively well. And it instantiates via monkey patching of the Python's pools, enabling the tbb threading layer
14:21
to be interchanged from, you know, with the MKL here. And so no code changes are required on your part because of that monkey patching capability. Another one that we use in this space is static multiprocessing, or SMP. And it's a pure Python package that manages nested parallelism through coarse-grained static settings.
14:41
And so what that means is it's trying to augment your parallelism by saying, I'm gonna take the rules that have been defined by your parallelism and the types of environment variables and pass them down to the inherited processes to try to control over-subscription via that method.
15:01
So it handles ones that are a little more structured in that way. It again instantiates via monkey patching and it uses affinity masks and over-openMP to statically define and allocate those resources to avoid those excessive threads. So now if we return back and we look at this example,
15:21
you can see that, you know, these two packages can address the issue that we have here. So with the nested parallelism, how does that actually work? So tbb tries to accomplish this by saying, okay, you have your application, you have your openMP threading, and you know, you have separate but uncoordinated areas
15:42
of openMP parallel regions. And so what happens is, you know, it tries to map too many software threads and it tries to compete for all the logical processors and it tries to map too many of them. And what running under the tbb module essentially does is it says, this is the pool that's defined and it can dynamically allocate
16:02
and release new ones to be able to operate within that pool. So it tries to keep them mapped to the logical processors while keeping that on hold of the over subscription. Because if one of these starts spawning like five or 10 of them,
16:20
while the other one starts spanning one, you can start seeing where the problem occurs. Whereas this, if it starts wanting to spawn, then it'll still be pulling from the same pool and still be mapped to an actual logical processor. Now, SMP does it in a completely different way. It says with the same problem that we had before, it's saying we want to take the thread pool implementation
16:40
and propagate the mask and settings towards each of the individual spawn processes to go down. And so you're essentially augmenting your MKLR BLAS threading to be able to have the augmented settings passed down to each of the threads that are created from those processes.
17:01
And so one of the advantages here is you can actually mix the type of threading. You can handle both types of open MP threading in this case, which is relatively powerful. So one of the things I'm gonna do here is I'm gonna show you a small demo of what this will look like when you start having over subscription.
17:23
And I'll be running this on one of our just relatively large two-use socket server to show you what that looks like and then show you how these frameworks address that problem. So right here, I'll show you
17:41
kind of what type of setup this is. So this has, with hyper threading, it has about 88 cores. So it's a two-use socket and it's one of the Xeon. So one of the things I'll do here is this will take a bit of a while, but let's see, let's hope the SSH tunnel works today.
18:13
So this will take a little bit of time. So what this code is actually running, and I'll show you here, it's a relatively benign piece of code.
18:21
And again, we have a for loop, we have a thread pool map that's from multi-processing, and we have a NumPy call that's called inside of it. And one of the things that you can see from this is it's a relatively small amount of code that we might've written ourselves
18:42
that could actually cause this problem. And so if I were to run this on my local laptop, it wouldn't be too bad. But if I'm running it on a system with that many cores and that many threads, it's gonna take a while. So here, this example will repeat three times
19:02
and it'll display the time that it took to actually get, to accomplish this. So the first one took 39 seconds. So now I have to burn off another 32 times 39 seconds while I'm talking here to let this complete. So again, while we're letting that run here,
19:21
essentially what it's done is we have our data, which is created by NumPy Random. We have our thread pool created through the multi-processing pool. We have a loop of three, which is looping three times. The time, it's relatively simple call here.
19:41
And then QR and that for the amount of data in that range. And so let's see if we've... Okay, so we've hit the second one now, right? So we just have to burn off another 39 seconds here. So again, this is because what we showed in that last slide before that is you're essentially hitting over subscription
20:02
because it's saying, oh, I have all these threads that I can address and I have the multi-processing then mapping to the threads. It's like, hey, look at all the threads that I can create. It's just gonna create as many as it can, right? And so that's where you can get in trouble is now you've written your application. It works great on your laptop. You scale it to your server,
20:23
you're on your production machine and oh, this is, why is it slower? Why is it so much slower? So one of the things that you can do now is you can actually run it like this, right? And what TBB is going to do is it's gonna say, I'm gonna set my pool size.
20:42
I think there's some defaults and you can actually look at what those are by doing TBB and then you can ask dash dash help and it'll show you what the default sizes are. You can set that dynamic pool size. So if I start running this, I'm probably not gonna have enough time to actually finish my discussion here before it just decides to clean itself up.
21:01
But yeah, so there you go. When we talk about combating over subscription, quantifying what that problem can have and that nested parallelism can have is very evident now, right? So something such as simple as the demo that I just showed you, TBB could just handle that
21:21
and that's relatively Python. You can actually call it your script under TBB. You made no code changes. I made zero code changes to this thing and it actually did that, right? SMP handles it a slightly different way, right? And so now if we call it under SMP and run it now,
21:43
then again, it's taking those settings, that augmented style of parallelism and now it's completing it relatively quickly. So here you can also see, it accomplishes the handling of nested parallelism and over description in a completely different way but it still addresses it
22:01
and still is able to handle that in a relatively simple way by allowing you to run under SMP without making any code changes, okay. So now that we've kind of seen this demo, I think it's time to bring it back a little bit
22:21
and talk about the industry again, right? So in concurrency, in Python's ecosystem of concurrency and parallelism, much of the concurrency and async areas are very rich with packages, right? There's a lot of packages in that space. We've done a lot of work with concurrent futures and it helps us solve the need of the majority of the Python users but now when we look at the areas
22:41
of true parallelism and data parallelism, it's a strong area but its focus has been relatively small in comparison to the concurrency and async offerings. So that's why when we look at the packages in this space it really hasn't been much shown in the area and we're trying to now make headway
23:01
as kind of one of the final frontiers of parallelism in Python. So most of the ways of achieving parallelism in this area rely on vectorization frameworks or with multiple processing or distributed methods. So I think that kind of pops the question of how do you do it in a semi-Pythonic way, right? So I'm gonna introduce this kind of silly idea
23:22
of Pythonic-ish. I'm not saying it's true Pythonic because that's a whole different discussion but let's just talk about Pythonic-ish. What makes it relatively Pythonic-ish is relatively few code changes, right? So you might have a small bit of code changes. Maybe you have to modify its current behavior
23:41
of one's framework to fit your needs so to prevent a massive rewrite. So that's one of the things that would be considered. Is it directly in the Python standard library? Is it writable from the Python layer? Do I have to drop into a different lower level language like C to be able to utilize it?
24:00
Is the interface easy to understand and does it keep you in the Python layer and then not drop to an intermediate representation, right? So I think that then poses a question, how close can we get? So if we look under the lens of tbb4py, it meets quite a bit of these but again, two of them aren't met
24:20
which is that it's not directly in the Python standard library and it's not writable from the Python layer. But on the other hand, you don't have very many code changes. You're not modifying a lot of the current behavior of that framework to make it work for you. It's a relatively easy interface. You saw that I just called it while under the module of tbb,
24:40
ran the script, made no code changes so it's relatively easy interface to understand. You can set those with some command arguments if you need it and it keeps you in the Python layer and doesn't drop to an intermediate representation. Looking under the lens of SMP, it's relatively few code changes and it doesn't modify any current behavior or framework.
25:02
Now one of the interesting parts is it is somewhat writable from the Python layer because it does have an API that you can use but you can also use it without that and you can run it just like I did where I'm just running it under the SMP module and letting it just pass down the settings. It's relatively easy to understand and it keeps you in the Python layer.
25:22
I think the other thing to also add here is SMP is completely in Python so you can look at it on our GitHub. It's a pure Python package. So from the standpoint of being able to integrate that into a solution or into other people's frameworks, it's relatively simple but it's still, again, not in the standard library
25:41
but it's maybe a little closer to it but it's accomplishing in a different way. So I think this then poses the question of these final four questions which is how realistic is it to have a firm requirement for a pure Python implementation, right? So TBB is not a pure Python implementation but SMP is
26:00
and these, again, now we're talking in the light of addressing nested parallels and over subscription. The second question would be what is the best way to modify your Python code? Is it monkey patching? Is it a different framework? Like how do we wanna address that space when we want to modify our Python code to operate under that augmented threading?
26:25
And at what level should the parallelism be controlled? Should we be controlling it at the module call level? Should we be controlling it when we're calling it from our own source code? Where should we be doing that? And can an interface be agreed upon
26:40
to operate on that parallelism, right? So concurrent features did that relatively well. Can we do the same? So let's answer the first two questions and now with the demos being shown for TBB4py and SMP, how realistic is it to have a firm requirement for a pure Python implementation?
27:00
I would say it's not required but it's highly recommended. We can see that with the uptake of the packages that we've released, people are more trending towards the pure Python variants of it. There's also limited things that you can do from the pure Python layer but maybe that's something that vendors can work with the actual Python in the PSF
27:22
and the core developers to try to find something out. And what's the best way to modify your Python code? Is it through monkey patching on your framework? It's seeming like monkey patching is the new normal from this space. We're seeing a lot of examples where monkey patching is becoming the de facto standard when making packages that augment other packages behavior.
27:44
We see that in Scikit-learn, we see that in other places so that seems to be the new normal and seems to be okay. I think you also have that question of at what level should this parallelism be controlled? Should it be controlled at the Python layer maybe?
28:01
So I think that question is like, well, the Python layer, it's sort of, it can be controlled from that area. The challenge that you'll start finding is that it needs directives for how additional layers can compose it. And that in itself, maybe some type of composing directive would be useful in that space.
28:21
Can an interface be agreed upon to operate on that parallelism? I think the jury's still out on that one because with every iteration that we make of attempting these packages, we learn something new, we learn something that works and that doesn't work. And it seems like the Python community is still in that space. And I urge you if you're in this space to continue pushing and seeing what makes sense.
28:44
We're still very, very young in this space to know what is Pythonic, what's the best way to operate on it, what's the best way of operating and augmenting your threading behavior and keeping that to be able to scale when you actually deploy this to your production cluster or something similar.
29:03
But with SMP, we do get a slightly more clear picture as to what it could look like. So now that we look, I've talked about all of this, I think you can see now that TBB4Py and SMP attempt to address the Pythonic-ish methods that I've set out and augment the way you do multi-threading, multi-processing,
29:23
and try to do it with a way that makes it such that you don't have to modify a lot of your code. And I would still say it's best to leave the two forms of multi-processing and multi-threading at their same levels and to not really change too much of how we interact with them, at least from the Python and C levels,
29:41
try to keep them at their respective levels. And multi-threading is domain specific. So I've talked about when you do something that's data parallel, typically you know the domain that you're operating in, and that's seeming to be the best choice. And you have a lot of options for staying in Python, or you can drop down to C if you need it.
30:01
So NumPy has decided that they want to be in C, and these other frameworks allow you to stay within the Python layer and still not have to actually build with some type of C-based library. So numba, nem-expression, scython, do a great example of this. And one of the thoughts is, well, if you actually have some type of directive,
30:23
to say, okay, at this point I want to only have maybe 20% of the threads being able to be spawned during this one section, that might be better. You could leave that in the comments, but doesn't that just sound like pragma omp, right? So I think that then chooses the question of, well, what is Pythonic-ish at that point, right?
30:42
If we're leaving things that literally look like C, is that really that useful? Are we complicating the language? Is maybe that the way that we're gonna achieve composable parallelism when we start combining them? I think that that poses a great question. Augmenting the threading behavior
31:01
seems to be the more useful, based upon the experiments that we've run, and putting the bulk of responsibility, but that also means that putting the bulk of responsibility is on the users themselves, right? So depending on how you want to do, if you're a framework designer, how do you choose to do your threading is really your choice, you know?
31:21
And I think that that is a relatively heavy responsibility, not as heavy as expecting yours to always be caught midway being completely thread-safe, but it's also a high requirement. And threading, in general, for Numerical, has a lot of known frameworks, and I think the thing is,
31:41
if you're gonna try to remove the gill, or do anything similar, you're gonna be removing the ability to use just a Python object, and then you'll need stricter typing, right? So then that poses a question, well, why are you actually using Python in this instance?
32:00
So to kind of summarize everything, and kind of end it, the Python ecosystem has a critical mass of good frameworks that we kind of walk through today that look to address multi-threading and multi-processing. So for those of you who are working on it, keep on pushing and seeing what the limits are. Today's demonstration here, we're showing what we're trying to do on R space, and we encourage you to either contribute
32:22
or find other ways, and propose other ways of doing so. So thank you, and with that, I'm open for Q&A. Thanks very much. Does anyone have any questions?
32:47
Hey, great talk, thanks. You were asking about a good interface for integrating this into systems that require parallelism into Python.
33:01
You're probably aware of JobLib. Yes. So how does that interact? Is your stuff running below JobLib, or do you directly integrate with it somehow? So that's a great question. So if we step back a little bit, and we look at where JobLib sits within this part, so when you have JobLib,
33:21
and then you're calling things from JobLib, you can, you know, those are two different layers of parallelism, and that's where I was talking about where because there's actually no real communication between what the JobLib requirements are, and then the NumPy call that you put into JobLib,
33:40
that's where that over description area can occur. And so I think that JobLib does a very good, well, good job, no pun intended here, of being able to separate those tasks out, and being easily able to define a way to compose the jobs that you need to do in either a task parallel format.
34:02
It's biggest, I think, comparison would be Dask, and both of them do a very good job in that space, but I think we still run into the problem of we have a composable parallelism problem. We have something that's clearly a application level parallelism, and something that's clearly data or task parallelism,
34:24
and that link either, you know, from the way that we've defined Python, it either needs to be kept separate, or we need a way of interlinking it without breaking, you know, I guess the APIs that they were defined as. I think we're losing the abstraction capability
34:42
if we try to bring that layer down too much. Thanks. Thanks. Thanks. Anyone else? So you mentioned about CPU-bound parallelism,
35:02
that it's very clear when you can know about the oversubscription, but also there's a part of a sync-weight pattern that's basically for IO-bound processes. So at the end, it's very hard to determine what's a better way to approach with that, because you don't know if the CPU is starving
35:23
while you are doing all the IO-weight processes, or how you would approach that? So, I mean, one of the things that I do as a consultant is I work with customers that have those style of problems and to determine whether we have a CPU-bound slash IO-bound problem and determine what's actually the issue,
35:41
we typically use a Python profiler. So one of the products that we use is Intel v2 amplifier, and so we will take a look and see what's going on from a code profiling perspective, and then try to look at what's the behavior of the code in that space. Sometimes it takes a little bit of static analysis
36:01
to be able to determine that, or looking at IO saturation with the tools to be able to determine that, but it's actually a very, very hard thing to detect, and you're right, it is extremely hard to know. If given no tools, even with our open source profile, like the open source profilers in the Python space, yeah, it's actually very, very hard to know what's going on. Okay, thank you.
36:21
Hello, good talk, thanks. Two small questions. Is SMP developed by Intel as well? Yes. Okay, and which one was the first one to be developed? TBB was the first to be developed. The threading building blocks has been around for quite some time. I think it became open source recently,
36:42
but it's the longer legacy one, and then SMP was developed, I think, about a year ago to kind of address this space, because one of the systems that we used had 63 plus cores per socket, and then you get that towards a lot of them.
37:02
We started seeing that this problem existed when you start scaling out that issue, so that's why we developed it a little later in the game. Okay, and what is TBB for Pi doing in C? It's operating with the TBB runtime, actually.
37:23
So one of the things that we do is it's one of the libraries that we ship, and so it's actually operating directly with that dynamic library. So that runtime, basically, if you download any of our packages like NumPy or scikit-learn that utilize it, it'll download the runtime and interact at runtime with that library.
37:43
Yeah, thank you very much. Thanks. Anybody else with a question? Yeah. Hi there. Great talk.
38:01
I'd like to ask just what if I would like to run a few programs under the TBB or SMP? What would happen? Will they understand that their over some subscription could happen and, like, would happen? Or should I, like, concentrate everything
38:21
in one program to avoid rollback? That is a great question. So the way that both of these packages work and how most types of tools that accomplish in this, that accomplish that type of control in that space have to have is you have to start from a single Python process in order for that to work.
38:41
So if you think about how joblib and Dask are doing it, it's starting from something that's multi-processing down to something that's threading based upon those processes. But it's started from the same one. If you start them on different Python processes and they're not started from the same one, then that's problematic because they don't see each other. So they're going to have different pools.
39:01
Whereas if you start it from the same one, it's going to have the same pool. And so it'll be able to better handle over-subscription in that manner. So it's better to focus that and start it from a single Python process, if possible. I don't know if I got the difference between the TBB
39:22
and the SMP. What are the use cases, one over the other? So one of the things is with TBB, it is a dynamic. It handles dynamic types of threading better. So say you have something that returns within 10 seconds, but it has the chance of returning in a second.
39:43
If you have that with TBB, it'll be able to say, OK, this one ended quickly. We're going to put this one back in the pool and let it come back out. SMP handles it a different way, which is to say I'm going to pass down the settings for the amount of threads that can
40:01
be spawned from that process. So it's going to say, so say I have OMP num threads equal to 1 or 2. And it's going to say, OK, for this process, it's going to be 2. For this one, it's going to be 1. It's passing it down by saying I'm having these settings passed down. And that's how it controls it, by not letting it go outside of it.
40:21
But it's better for structured work. Because if you think about it, if it's structured and it passes down those settings, then they'll stay essentially semi-pinned to the processors and not be jumping to different processors all the time. And then you'll have cache issues if you start doing that. So symmetric work generally is better for SMP. And then more dynamic types of parallelism
40:44
that have the chance of returning a little bit earlier or unbalanced is better for TBB. Thank you. Thanks. Any more? Can we use these packages without the Intel version
41:04
of NumPy, SK11, and so on? I mean, you can download the packages themselves. So you can download TBB for Py as a standalone and just run it with your, again, this is kind of the talk about Pythonic-ish, right? So you can download both of these packages independently of our distribution on Conda,
41:23
on our Conda channel, which is you can just use the dash C Intel channel if you're using Conda to actually look up these packages. Thank you. One more.
41:41
Hi. Can you say something about platform compatibility? I guess it runs on Linux. But what about BSD, Windows, Open Solaris, and so on? Great question. TBB runs on all platforms right now. So we have it for Mac, Windows, Linux, and majority of flavors of Linux as well.
42:02
SMP right now is Linux only because of some of the items that we're using, but we're looking to see what other options we have in that space. But it's just currently only on Linux for the time being. Any other questions? Is that one back there?
42:21
No? All right. Well, we'll thank our speaker, David, again. Thanks.