We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Tricks for Efficient Multicore Computing

00:00

Formal Metadata

Title
Tricks for Efficient Multicore Computing
Title of Series
Number of Parts
43
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production PlaceErlangen, Germany

Content Metadata

Subject Area
Genre
Abstract
In this presentation you will learn about: - The concurrent.futures API of Python 3 - Threads vs forked processes vs spawned processes (pros and cons) - The impact of the GIL on CPU-bound Python programs - Bad interactions of fork-based multiprocessing and OpenMP runtimes - BLAS-based parallelism (e.g. MKL, OpenBLAS) - Memory bandwidth bound operations, arithmetic intensity and the roofline model.
5
Thumbnail
1:34:10
33
Thumbnail
1:31:57
34
Thumbnail
1:28:12
35
Thumbnail
1:27:32
41
Thumbnail
1:31:21
43
Thread (computing)Process modelingConcurrency (computer science)ImplementationState of matterLetterpress printingAsynchronous Transfer ModeData modelSystem callOverhead (computing)TelecommunicationObject (grammar)Real numberPhysical systemInterpreter (computing)Read-only memorySocial classContext awarenessVirtual machineFitness functionResultantInstance (computer science)Machine visionGame controllerFunctional (mathematics)BitAttribute grammarObject (grammar)SynchronizationComputer programmingImplementationInterpreter (computing)Computer chessNeuroinformatikState of matterConcurrency (computer science)AdditionLibrary (computing)Semiconductor memoryTask (computing)Operator (mathematics)Lattice (order)Centralizer and normalizerThread (computing)User interfaceSoftware engineeringParameter (computer programming)Back-face cullingSlide ruleBlock (periodic table)Data managementOvalSingle-precision floating-point formatProduct (business)Process modelingResolvent formalismFerry CorstenDisk read-and-write headWebsiteContrast (vision)Point (geometry)Projective planeMereologyExtension (kinesiology)Multiplication signBuildingDifferent (Kate Ryan album)TelecommunicationBoss CorporationComputer configurationException handlingProfil (magazine)Volume (thermodynamics)WorkloadTouchscreenReal numberWindowPhysical systemForcing (mathematics)System callStandard deviationMehrprozessorsystemLimit (category theory)Parallel computingHypercubeOverhead (computing)MultiplicationBefehlsprozessorMulti-core processorShared memoryFront and back endsDebuggerEndliche ModelltheorieComputer animation
Operations researchInterpreter (computing)CodeThread (computing)Library (computing)Read-only memoryOverhead (computing)TelecommunicationParallel computingMultiplication signState of matterInterpreter (computing)Object (grammar)Disk read-and-write headLibrary (computing)Functional (mathematics)Different (Kate Ryan album)Arrow of timeCASE <Informatik>DiagramPiDatabaseSemiconductor memoryMultiplicationInstance (computer science)Physical systemDigital electronicsContent (media)WebsiteRight angleHTTP cookieExtension (kinesiology)Electronic mailing listException handlingPoint (geometry)TelecommunicationPlastikkarteSpacetimeSocket-SchnittstelleParameter (computer programming)Local ringUniqueness quantificationProcess modelingResultantThread (computing)CodeFlow separationOperator (mathematics)Parallel computingVirtual machineCore dumpNeuroinformatikOverhead (computing)Attribute grammarMatrix (mathematics)Multi-core processorComputer animation
Interpreter (computing)Open setOverhead (computing)PhysicalismProcess modelingOverhead (computing)Thread (computing)Crash (computing)MehrprozessorsystemComputer programmingView (database)Interpreter (computing)Point (geometry)MultiplicationExterior algebraDefault (computer science)Module (mathematics)Computer wormLibrary (computing)Instance (computer science)Process (computing)Software bugFerry CorstenPurchasingFunctional (mathematics)Computer animation
Thread (computing)CodeOverhead (computing)TelecommunicationThread (computing)TelecommunicationImplementationCodeProcess modelingOverhead (computing)Default (computer science)Computer programmingLibrary (computing)Revision controlExterior algebraCrash (computing)Multiplication signComputer animation
DeadlockException handlingSystem callFunction (mathematics)Concurrency (computer science)Lambda calculusAttribute grammarLine (geometry)Queue (abstract data type)Core dumpCommunications protocolProcess modelingSoftware testingInheritance (object-oriented programming)Exception handlingProjective planeInstance (computer science)Point (geometry)Computer programmingLevel (video gaming)Computer animation
Function (mathematics)Factory (trading post)Letterpress printingSystem callException handlingDeadlockBitSheaf (mathematics)Instance (computer science)DeadlockState of matterParameter (computer programming)Software testingInheritance (object-oriented programming)CodeIdentifiabilityResultantProcess modelingOrder (biology)Key (cryptography)Table (information)Computer animation
Array data structureVariable (mathematics)Thread (computing)Control flowCoroutineDot productIntelMathematical analysisHeat transferComputerFLOPSBefehlsprozessorRead-only memoryMatrix (mathematics)Floating pointLimit (category theory)BefehlsprozessorNumberFLOPSVirtual machineMulti-core processorMathematical analysisIntegrated development environmentData transmissionProcess modelingCASE <Informatik>Dimensional analysisOperator (mathematics)Thread (computing)Matrix (mathematics)CodeCore dumpMultiplicationDefault (computer science)ImplementationComputing platformLinear algebraParallel computingBinary codeOpen setRectangleOpen sourceInstance (computer science)Band matrixArray data structureLibrary (computing)Vector processorPlotterTheory of relativityBit rateNeuroinformatikoutputLinearization2 (number)Element (mathematics)Computer hardwareFunction (mathematics)Maxima and minimaQuadratic equationMachine visionCoroutineMetric systemPoint (geometry)Medical imagingDistribution (mathematics)AreaComputer animation
Read-only memoryBand matrixImplementationThread (computing)Semiconductor memoryNeuroinformatikBand matrixLimit (category theory)Instance (computer science)DivisorPlotterVector processorMaxima and minimaScaling (geometry)FLOPSBefehlsprozessorCartesian coordinate systemDot productNumberCore dumpLaptopImplementationCodeCASE <Informatik>Thread (computing)Multi-core processorBitScalabilityMatrix (mathematics)Dimensional analysisMultiplication signOpen setDifferent (Kate Ryan album)Workload2 (number)MultiplicationRight angleTelecommunicationVirtual machineBenchmarkComputer programmingMetric systemSound effectGroup actionDistanceBit rateSystem callOperator (mathematics)Matching (graph theory)Musical ensemblePhysicalismDiagram
Thread (computing)Operations researchCodeSheaf (mathematics)Process modelingMechanism design2 (number)MereologyPoint (geometry)Machine visionWave packetOverhead (computing)CodeMusical ensembleProcess modelingThread (computing)Extension (kinesiology)Key (cryptography)NeuroinformatikGoodness of fitCrash (computing)Sheaf (mathematics)SpacetimeOpen setParallel computingLibrary (computing)Operator (mathematics)Process (computing)BenchmarkComputer animation
Slide rulePresentation of a groupComputer animation
Right angleMachine visionPoint (geometry)Process modelingField (computer science)2 (number)Data managementDefault (computer science)Computer animation
InfinityElement (mathematics)Technische MechanikParallel computingSequenceNetwork topologyMultiplication signParameter (computer programming)Power (physics)Game controllerComputer animation
InfinityElement (mathematics)Technische MechanikBeat (acoustics)Point (geometry)Electric generatorBenchmarkSoftware developerDefault (computer science)Kernel (computing)Computer animation
Computer animation
Transcript: English(auto-generated)
So good afternoon, I hope you have enjoyed your lunch Now we are going to continue with the talks. First one is with Oliver Grissel, and he will be talking about tricks for multi efficient multi-core computing, please
Thank you. Hi everyone so My name is Olivier. I'm a software engineer at Inria and I work Among other things on scikit-learn, but today I'm going to talk about how to use single machine
Multiple CPUs on a single machine to do compute intensive operation computing and intensive workloads Is the volume good enough in the back? Yeah, all right And this work is joint work with Thomas Moreau from CMLA and ENS, so the slides are online if if the
If this is a bit too small on the on the screen you can check it out offline online so today I'm gonna talk about the concurrent futures API from the standard library and then This API is quite flexible and it allows you to do thread based or process based
multi-core computing and we'll discuss the the pro and cons of the two options and Some Limitation and some extensions that we provide in a third-party library and finally, I would like also to discuss a bit of thread based parallelism for array operations using blast via numpy for instance and
How to better understand the performance profile of those operations? So let's start with the Embarrassingly parallel computations. So in the standard library, there are two options The the oldest one is the multi-processing pool class
Which is the first implementation of a pool of worker where you can dispatch? function calls to be executed in parallel on different course and More recently in Python 3 there is a new API which is called concurrent features which is also part of the standard library on Python 2 there is an extension package if you want and
It's reusing the under the hood building blocks of multi-processing, but it's for it's providing a new front end that is also more stable and the API is more flexible and nicer and Finally, we'll talk a bit about how to make it even better with using a third-party project, which we call Loki
so concurrent features So in concurrent features, we have this notion of future a future is An object that is a reference to a results For some asynchronous computation that has been dispatched on Our worker running on a separate core outside of the main thread
Of the main program the future itself can be in four states not started running Cancelled or down and you can it's a Python object that you can introspect without blocking Using the attributes running canceled and down and you can also call blocking method on that object to fetch the result and to wait for the results and
If the function call has to get an exception and you call results you will get the exception directly, but you can also Inspect whether or not there is an exception and fetch it this way So, let's see an example So assume that you want to do heavy computation like for instance fitting a machine learning model with some hyper parameters on
Dataset so we have this function fit model that is just returning a model, but I assume that it's it's taking At least several minutes or maybe several hours to to to complete what the
Concurrent future provide you with is an executor class So for instance the thread pool executor class, but you could also import the process pool executor class And when you use it as a context manager with the with block As soon as you enter that block it will create the workers so it could be threads or it could be process and
when you're under that Context using the executor object you can submit function calls So the function fit model has been defined in the main program, but you can submit the execution So it ships the definition of the function and the argument into some worker and what you get back at the result
In the main program is a future object that you can use to asynchronously control What's happening on the worker and you can go on and submit additional tasks for instance another function call on another another worker with different Parameters and in the meantime you can do other stuff in the main program like monitoring what's happening updating the user interface or something like this and
The execution is happening under the hood on those back ends and they could be done on it And you can fetch the results after what so to fetch the results you just call results On the future object and it will transfer back The the result of the computation at this point to the main program and we can do that for the other one as well
Which is already done So we don't we don't have to wait and as soon as we exit from this context manager with block It destroys all the workers. So it Release all the system resources that were necessary to do the the prior computation and You're you're left with the results of the computation that you fetched
previously so we can select two kinds of worker process one is thread based and the other one is process based so a thread worker is a real system thread so under Unix, it's a p thread on Windows the window thread and
But all the threads Are are sharing the same Python interpreter There is only the main Python interpreter running in the main program that is being reused for the work the other workers So this has many advantages such as it's very fast to start and stop a thread There is a little memory overhead, but it's very low just for the straight state itself
But there is no copy of the object that you that you pass to do to do the functions and More importantly that makes it possible to have no communication overhead between the the main process and the workers But then we are using
Python object to do that a share Python object and because we are sharing the Python object We need to make sure that the the interpreter state and and the object states are not corrupted by concurrent execution and to do that the the C Python implementation is using a global interpreter lock
Which has some which introduces some challenges in particular this global interpreter lock Is locking every time is acquired every time you access An attribute or are you called into a function of a Python object a method of a Python object?
Or you mutate the state of a Python object So it's it was not initially designed for efficient multi-core computing but more for simplicity to make sure that the interpreter is safe when you Run on separate thread. However, it can be released this lock when you're doing long-running IO operations This is why it's very efficient to use
Thread pool to talk to a database or to connect to multiple websites because you can do other stuff in Python while Waiting for the results of a download for instance, and it can also be released explicitly by some compiled extensions In third-party libraries for instance numpy pandas scikit-learn
Pretty much any library that is written using Satan and that makes it explicit to raise the gear Then it's possible to have Python code that runs in parallel of numpy in a different thread for instance on different course So here is the time diagram that is inspired from David Beasley's talks on the gill
So you can see different Python threads running in parallel and You you see the arrows This is pure Python code running and you can see that at any point in time
Only the gill is acquired by one specific thread And so only that thread can run the Python code and the others have to wait Which means that if you do that to run pure Python code It's not quicker than sequential execution using one thread to run everything Even if you're on a multi-core machine, however, if you do IO intensive stuff like talking to a database
Whenever you're talking to the database you release the gill so you can call in another thread some Python function If you are calling into some native library that is releasing the gill for instance If you do a large matrix matrix multiplication numpy, the gill is released during that time. So you can do other stuff in the meantime
The other way to do Processing is to use a process worker instead of a thread in that case you create a new Python interpreter for the for this specific worker and
So that the Workers are no longer independent are completely independent of one another So this has some inconvenience because starting a new Python process but it can take some more time even just the process itself the system process, but also the interpreter and there is a higher memory overhead because the interpreter itself in memory takes some
some space and there is also a higher communication overhead because you have to Pickle and unpickle the arguments of the function calls and transfer them using a system pipes or sockets systems local circuit unique sockets and stuff But then you don't have the gill contention issue anymore because the workers are completely independent of one another
and so you can If your code is embarrassingly parallel, then you can really benefit from all the cores on your machine However, you have to know that there are two ways to start a new interpreter for a worker you can on POSIX at least
since POSIX is Unix or Mac OS and It could be just fork or a spawn a spawn. It's a fork plus an exec And if you just use fork it's An advantage and it sees the default behavior of a multi processing because
It's a low spanning overhead. It's quite quick to do that And the interpreter is already worm imported because it already have a copy of all the imported modules that were previously imported in the main program but then there is some inconvenience because It's it's not very correct way to do that to do this from a Unix point of view
And so that can trigger crash in third-party multi-threaded programs For instance, if you have XGBoost which is based on OpenMP or spaCy or OpenCV, if you use it in the main program and you call a function in parallel using multi-processing in a sub program then the internal thread pool of XGBoost or OpenMP will be inconsistent in the in a worker and
basically it will crash, it will freeze. And so this is really hard to debug and very confusing for the users So this is why I would not recommend to use the fork way to start the workers So the alternative is to use Pawn but then the inconvenient
It's it's safe But the inconvenience that it's slower to start typically a couple hundred milliseconds if you launch many workers And you need to to reimport the libraries in the workers because those are completely new interpreters so as a summary If you use threads
You you if you really want to do pure Python code in parallel, can you do that with threads? But if you call into NumPy or pandas or something that release the gear That's not necessarily an issue and then you benefit from the fact that you don't have any communication overhead It's POSIX safe and there is a low spanning overhead
However, if you really want to do a pure Python code in parallel, then you're forced to use a process and Then with the fork you have this POSIX safety issue, you can trigger the crash and you have the communication overhead of having two process talking to one another and
the alternative the safest alternative would be to use a spawn, but then you have the issue of low spanning overhead. So starting the new process takes more time So this is why we introduced Loki as a library Which makes it possible to hide this issue by keeping a worker of a pool of worker running in the background
So that you can reuse this And hide this starting overhead So basically we we want to reuse the process pool executor and to keep it as a global singleton of your Python program
The problem is that You need to make sure that it will never deadlock and some the default implementation especially in all this version of Python of Process pool executor is actually a reason to cry when you submit Something that is not peak level then you deadlock your program. You have to control C
You cannot catch this exception is just written on STDR But your program is broken, basically You have to control C and to restart or whatever and and the instance of process pool executor is unusable at this point It can be completely covered corrupted So in the Loki project
We we derive from this base class But we make we added some more tests for very weird situations where a worker kill one another or you send unpickable stuff in the result of the of the of the execution to try and trigger all these possible weird scenarios and to detect them and to fix them and
We plan to contribute the fixes to upstream Python, but it won't be a variable before 3.7 or 3.8 So in the meantime, you can use a Loki So and furthermore with Loki you can reuse the executor you can reuse a singleton a global singleton So we have this get reusable executor when you create it, you can see you can introspect there is a an
Incremental a incremental identifier on that you can submit code and get some result as with a regular executor If you later in your code you you try to fetch the executor again, it will return exactly the same instance So it's very cheap very quick
and then if you Submit unsafe stuff like for instance an argument that is not pickable. Then you get an executor that you can catch it's not a deadlock anymore and If you try later to refresh the executor, it will build a new executor that is in a working state So it's very easy to to work in a safe manner and a quick manner with this
so finally The last section of my talk would be to talk a bit more about thread based parallelism for NumPy related stuff like array processing vector matrices and so on
So what is blast blast is basically the engine behind NumPy and many Array operation implemented in NumPy are actually just wrapper around Linear algebra routines that are implemented using a standard API called blast and there are two very efficient implementations that are very popular
One is open source and is called open blast and it's used by default by By NumPy when you download it when you download the binary wheels from a API At least under Linux or if you use the the conda binary packages from conda forge
On any platform it's using open blast and if you use the anaconda distribution, it's using another Implementation of blast which is even more efficient but on Intel CPUs, but it's proprietary Which is called Intel MKL So for instance if you do a dot of two
square matrices or do rectangular arrays a and B You will see by default if you top or edge top or something this that all the CPU cores of your machine will go green 100% usage in many cases and this is because under the hood those library they do the thread based
Parallelism by themselves. It's not even NumPy that does it Really the blast implementation and so you can control how many threads are gonna be used. It's a bit small. Sorry By setting up environment variable based on your on the implementation of the blast that you are using So you have to know about which one is is being used under the hood
And if you set open blast num thread equals 2 and then you run your Python code you can restrict the the NumPy operation to a specific number of threads and leave some CPUs for other stuff and For MKL, you just have a different variable which is called MKL num thread. It's exactly the same behavior
So it's interesting to know about this and it's even more interesting to you to know about what are the the performance behavior of? using threads or even process on Multi-core machines and one way to analyze this behavior is to is to do what we call a roofline analysis so
What can make your process What can limit the performance of your process is the hardware it could be either the the compute capabilities of your CPU So you have a maximum CPU speed? That really depends on the CPU themselves or for instance 100
Gigaflops per second on a multi-core machine or more than the multi-core machine could be even higher than that So a flop is a floating-point operation like adding two elements in an array or multiplying two elements in an array and so on but also accessing data from the RAM to do computation on it is limited and you typically have a
bandwidth limit of let's say 50 gigabytes per second on a big machine and To know whether this is the run that limits or the CPU Can be tricky and one way to to know is to compute the arithmetic intensity of your workloads, which is the ratio between
floating-point operation that you want to do compute Divided by the number of bytes that you need to transfer over the RAM to do the actual computation So if we take the example of matrix matrix multiplication with two matrices of dimension NK and KM
The number of flop of floating-point operation that you need to do is a is a cubic Cubic quantity. So the multiplication of all the dimensions There are three dimensions and for for the the data transfer It's actually the size of the two input matrices plus the size of the output matrix. And so this is a quadratic
Quantity with respect to the dimension. So depending on the dimensions you can have a linear Arithmetic and intensity and there is a particular case where n is equal to 1 for instance Then you have a the ratio of two Quadratic quantity, which is a constant in that case. It's one quarter or something. So it's always fixed and very low
so let's have a look at a plot a benchmark of the arithmetic intensity versus on the x-axis versus the Performance the speed of computation in gigaflops per second on the y-axis
and this is a bunch of matrix matrix multiplications for different dimensions using numpy and MKL and You see the blue dots they are using MKL num thread equal one So I limit the number of thread available and the orange dots are with two threads. This is on this laptop
so there is only two physical core so and The the axis are in logarithmic Scale and so you can see that the maximum orange performance is around 40 gigaflops Whereas the maximum blue performance is around 20 21 gigaflops So it's approximately a scale up by a factor of 2 which is what is expected
We are really benefiting from the course if we have a large arithmetic intensity But if we have a low arithmetic intensity, for instance If we do a vector matrix multiplication, then we are on the left hand side of that plot You can see that then we are limited by those slopes those this is the the memory bandwidths the communication to the memory that is limiting the computation and we cannot really benefit from
from the high-end CPU that we have because just talking to the memory is limiting the computation and So I don't know exactly the the theoretical limits from that machine, but you can see that it's around on my Mac It's around 25 gigabytes per second
So if your compute is very simple, you can compute the arithmetic intensity of your program And then you can based on this kind of benchmark. You can see whether or not you you're limited by the memory or by by the CPU and
I did the same using open glass and you can see that the peak performance is actually approximately the same as MKL However open blast compared to MKL for lower arithmetic intensive Workload is less efficient. So this is MKL. This is open blast So when people say that MKL is 10 times faster than open blast It's completely wrong if you are on the right hand side
but it could be significant open blast can be Significantly slower if you if you're working with small matrices and you're doing matrix matrix operations, but in general, it's quite competitive And I also run the the same on a larger machine with So blue is one core
Orange is five cores five physical cores and green is 10 physical thread 10 thread on 10 physical score and you can see that the scale up on the right hand side is perfect We go to from 10 to 50 to 100 gigaflop per second But on the right hand side on the left hand side you the memory bandwidth and is limited
And if I compute the speed up of 5 over 1 and 10 over 1 so this is a speed up on the y-axis this time You can see that for low arithmetic intensive workloads the speed up the maximum speed up that we observe is 4 instead of 10 or Instead of 5 or as if you go on the right hand side, it can go close to 10 or close to 5 but
So it means that the the scalability of your code is for lower arithmetic intensive stuff is really limited by the Memory bandwidth first and you really cannot benefit from a multi-core in that case. So Depending on what you do. Sometimes you cannot even use your your CPU even with a very optimized
Implementation so you have to be very aware of that So multi-core scalability can be limited by IO with the RAM for low arithmetic intensive stuff So a bit of conclusion now
So the the thread based mechanism to do multi-core computation can be very efficient if you're if the critical sections of your the performance critical sections of your code are in compiled extensions that release the gill and Usually that's this method as the lowest overhead and you should definitely start with this and benchmark. What kind of speed up you get?
If you don't get a good speed up Then you can decide to try you can try a process based mechanisms but I would start with thread first and if you are using a real operation make sure that just the The built-in mechanism from numpy using blast is not already or the best way to do it most often
It's going to be the best way to do it and there is no point in trying to doing parallelism on top of that So the more if you want to do process based parallelism then you should definitely check out lucky To make sure that you have a safe way to do it and you won't crash XG boost or spacey or open MP code
Or at least be aware that this problem exists and finally Loki is going to be integrated in job lib which is another API a simpler API to do embarrassingly parallel Computation And it's at some point going to be part of a psychic on to be
For training models in parallel in a safe way So thank you very much for your attention
Thank you for the very great Presentation we have a couple of minutes for the questions Yeah, thanks. This was a great presentation. I want to ask about Loki's
Executor manager, so You said that the processor staying alive my question is like Can you and When are you deciding to actually drop the the worker processes because at some point you have to do this, right?
Yeah, actually I made a simplification We keep them alive for like 10 seconds or 30 seconds by default and then they will automatically stop by themselves if they have nothing To do so that if you call a parallel Sequential parallel sequential very quickly you don't pay for the the spawning of a head But if you just do parallel and then sequential for a long time and then parallel again
then you restart but it's not a problem because You don't do it very often and does the user have More control on like killing the tree I see you have keyword arguments if you want to to set the walker time out
But usually there is no point in tweaking this like keeping the default should be fine. I Thanks for a nice talk Do you have any experience with MKL on AMD processors? And if so, can you comment on speed? I haven't run the benchmark
But I know that OpenBLAST has many specialized kernel for AMD and the OpenBLAST developers They follow the newest generation of the chips Even though the maybe the very latest are not yet implemented, but usually it's quite good MKL, I don't know. It's probably reasonable but
Intel has no incentive to make AMD chip work faster Any other question, thank you so much