We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Parallelism Shootout: threads vs asyncio vs multiple processes

00:00

Formal Metadata

Title
Parallelism Shootout: threads vs asyncio vs multiple processes
Title of Series
Part Number
143
Number of Parts
173
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language
Production PlaceBilbao, Euskadi, Spain

Content Metadata

Subject Area
Genre
Abstract
Shahriar Tajbakhsh - Parallelism Shootout: threads vs asyncio vs multiple processes You need to download data from lots and lots of URLs stored in a text file and then save them on your machine. Sure, you could write a loop and get each URL in sequence, but imagine that there are so many URLs that the sun may burn out before that loop is finished; or, you're just too impatient. For the sake of making this instructive, pretend you can only use one box. So, what do you do? Here are some typical solutions: Use a single process that creates lots of threads. Use many processes. Use a single process and a library like asyncio, gevent or eventlet to yield between coroutines when the OS blocks on IO. The talk will walk through the mechanics of each approach, and then show benchmarks of the three different approaches.
Keywords
51
68
Thumbnail
39:40
108
Thumbnail
29:48
Parallel computingThread (computing)Thermodynamischer ProzessProcess (computing)Observational studySoftware engineeringExpected valueParallel portComputer animation
Solitary confinementPoint (geometry)Pattern languageGroup actionExpert systemSlide ruleDifferent (Kate Ryan album)Computer animation
Uniform resource locatorComputer fileVirtual machineMehrprozessorsystemModul <Datentyp>Rule of inferenceContent (media)InternetworkingReading (process)Operations researchFunction (mathematics)outputComplete metric spaceBound stateBefehlsprozessorProcess (computing)FrequencyTask (computing)SequenceCAN busBenchmarkThread (computing)Real numberInduktive logische ProgrammierungLetterpress printingSocial classInstance (computer science)Interpreter (computing)DemonDifferent (Kate Ryan album)Queue (abstract data type)CASE <Informatik>Observational studyProcedural programmingSound effectRight angleConsistencyVideo gameWordType theoryPhysical systemThresholding (image processing)Cellular automatonElement (mathematics)Decision theoryStatement (computer science)Reading (process)Multiplication signFraction (mathematics)Mechanism design1 (number)BenchmarkPosterior probabilityInformationThread (computing)Form (programming)Functional (mathematics)MereologyUniform resource locatorStress (mechanics)Different (Kate Ryan album)Dependent and independent variablesNatural numberCartesian coordinate systemLibrary (computing)Point (geometry)QuicksortEndliche ModelltheorieSlide ruleResultantTask (computing)Insertion lossLine (geometry)MathematicsMultiplicationNumberComputer programmingRevision controlInstance (computer science)DistanceDirection (geometry)Thermodynamischer ProzessSequenceBefehlsprozessorContent (media)Queue (abstract data type)NeuroinformatikDemonInterpreter (computing)BitData storage deviceWindowReal numberOverhead (computing)Core dump2 (number)Module (mathematics)Computer animation
Uniform resource nameInterpreter (computing)Programmer (hardware)Thermodynamischer ProzessModule (mathematics)BefehlsprozessorMehrprozessorsystemQueue (abstract data type)Process (computing)Packet Loss ConcealmentoutputUniform resource locatorDifferential algebraic equationFunction (mathematics)Parallel computingObject (grammar)Content (media)BenchmarkLevel (video gaming)Software frameworkMachine codeTrailPoint (geometry)State of matterCoroutineContrast (vision)Thread (computing)Series (mathematics)Euclidean vectorScheduling (computing)Loop (music)Event horizonChief information officerDependent and independent variablesMultiplication signFunctional (mathematics)Electronic mailing listCode refactoringSystem callSimilarity (geometry)Thread (computing)Level (video gaming)CoroutineMachine codeParameter (computer programming)TelecommunicationModule (mathematics)Scheduling (computing)MehrprozessorsystemBefehlsprozessorSlide ruleOrder (biology)Roundness (object)CASE <Informatik>Object (grammar)Line (geometry)1 (number)Row (database)Uniform resource locatorDifferent (Kate Ryan album)Form (programming)MultilaterationMereologyConcurrency (computer science)Interpreter (computing)Point (geometry)Complete metric spaceContent (media)Data managementDrop (liquid)Event horizonOverhead (computing)Loop (music)Term (mathematics)Core dumpGraph (mathematics)TrailNumberTask (computing)Thermodynamischer ProzessProcess (computing)Decision theoryNoise (electronics)RadiusProteinWeightMathematicsEndliche ModelltheorieSource codeArithmetic progressionCellular automatonPhase transitionElement (mathematics)Classical physicsNetwork topologySummierbarkeitExponentiationRight angleGenderView (database)Carry (arithmetic)Information securityFamilyMultiplicationPresentation of a groupRule of inferenceMedical imagingWave packetComputer animation
BenchmarkUniform resource locatorLine (geometry)CASE <Informatik>SequenceNumberComputer animationDiagram
Drum memoryPairwise comparisonParallel computingMulti-agent systemMachine codePoint (geometry)Type theoryDifferent (Kate Ryan album)Uniform resource locatorCoroutineMultiplicationTouch typingMachine codeTask (computing)Multiplication signSlide ruleEvelyn PinchingBitCASE <Informatik>VideoconferencingSynchronizationLink (knot theory)Right angleDecision theoryExecution unitSequenceSource codeObservational studyComputer animation
MUDElectric generatorMathematicsEndliche ModelltheorieProcess (computing)Source codeInterface (computing)Power (physics)WordCategory of beingThermodynamischer ProzessSet (mathematics)Direction (geometry)Dependent and independent variablesLevel (video gaming)Revision controlDrop (liquid)Sampling (statistics)Student's t-testCurvatureVideo gameSoftware frameworkSoftware developerRight angleThread (computing)Dataflow2 (number)Multiplication signSlide ruleSoftware testingDiagramForm (programming)Point (geometry)Overhead (computing)Semiconductor memoryWave packetWindowQuicksortLogic synthesisWeightReading (process)Event horizonBitWeb pageCoroutineTask (computing)Green's functionLecture/Conference
Transcript: English(auto-generated)
Okay, so parallelism shootout, threads, multiple processes, asyncio. Just to set the expectation before I start, this is not a deep dive into any of these, so definitely not some advanced talk, probably intermediate, maybe even inter-intermediate,
depending on how much you know about them. So my name is Sharyar, I'm a software engineer in London, working at Bospa, and I don't have a presentation, because it quit unexpectedly.
Today we're going to talk about parallelism, and the point of it is to take one problem, it will come on the slide eventually, and try and approach solving it using different techniques, threading, multiprocessing, or asyncio, and just get a feel for how each of them work.
That's me. We want to take this problem that we have, so we have lots of URLs in the file, and we want to download their content and store it on our machine. And the point of it is to use the threading, multiprocessing, asyncio libraries or modules separately,
and then firstly get a feel for the mechanics of how they work, and secondly to be able to do a simple benchmark. Benchmarks make nervous, especially for parallelism, so don't take it too seriously, it's just to give you an idea, or me an idea, of how they compare to each other.
So before we start, I'm just going to break down the problem into three main bits. So the first is that we're going to read the URLs from a file, the second part is to download it from the internet, and the third one is to store it on our machine.
But before we begin, just a quick reminder, who is familiar with IO-bound and CPU-bound types of computation? Excellent. Just a quick recap, CPU-bound are basically computations that are hungry for CPUs, so if you give them more CPU or faster CPU, they perform faster and they end quicker,
and IO-bound computations are ones that the time it takes for them to complete, it depends how long they're waiting for IO. So you can give them a really fast CPU, but it won't make a difference, because it's blocking on IO. So to go back to our three original thought problems, reading URLs from a file, IO-bound, because this affects us, we're reading it.
Downloading content, IO-bound, HTTP request, again we have to block and wait. And storing the content on our machine, again, we're writing to disk, so that's IO-bound too. And just as a random thing, generally a lot of things we do are IO-bound, for a weird definition of generally, but usually day-to-day tasks are IO-bound.
Before we even paralyze though, I think it would be good to just quickly go through a sequential approach, and I think that would be a good baseline to compare how much actually paralyzing it improves it, and how different methods have different improvements.
So a bit of a mouthful, I've put the whole thing on there, because you can actually run this and it works. Interesting, the highlighted bit, for the sequential approach, we go over the URLs, those functions are just for convenience, so that I don't have to write a lot of things again, but they do what they say they do. So we go over the URLs, we get the content, and we put it on our machine.
But we do this sequentially, so we do one, we do the next one. And when we think about tasks, or when I think about tasks, and if I want to make them faster, I would have to think about, so how does this look on my CPU over time? So the way this looks is that it's only running on one of the cores, let's say we have two cores.
As far as I'm concerned, second core is skiving, because it's doing its own stuff, but as far as my task is concerned, it's not doing anything. So I'm doing a bit of work, getting the URL, downloading it, storing it on a machine, and doing the next one, it just continues like that. But to even be more accurate, what's actually happening is that we're doing a tiny fraction of CPU work,
then we're doing nothing, which is the dotted lines, because we're blocking for IO. CPU is not doing anything, and then we do a tiny amount more CPU work. But in reality, this is not actually to scale. So if I was to show it to scale, the bit where the CPU for this particular task is actually engaged is very, very small.
So this is a proper IO bound task. And just to show how this works over multiple URLs, so we have one URL takes a tiny amount of time, 30 URLs takes a lot more time. By the way, in the beginning I said the problem statement is that we have lots and lots of URLs.
I've just used 30 in this case, because it was much easier to run it multiple times. But imagine this over a gazillion URLs. It's not going to happen with sequential approach. But it just predictably goes up linearly.
So threading is the first method we're going to use. Threads in Python are actual real threads. Also, there's no controversy in this. I'm not going to talk about the global interpreter lock, or fix it, or it's just there. That's not going to happen. But just so we know, threads are actual P threads, or Windows threads, or whatever.
They're real threads, right? And quick recap on how to make them. Is everyone familiar with how to use threads? Okay, fair enough. So quickly we can make them two ways. Either subclass threading a thread, and override the run method, or just have a function and use the normal thread class, pass it as the target,
and let it do the work. To run the threads, it's just called the start method, not the run method. Call the start method, and it goes and does its stuff. And it stops when your actual function, so in the left case, the run method,
and in the right one, the actual do work function. The thread stops when that function has reached the end. But what if that function never reaches the end? What if we have a while true in it, so we want it to constantly work? Then we have daemonic threads. So we pass the daemon equals true to the constructor,
and that tells it that you will stop whenever the main thread stops. So when the main thread stops, that will stop. Otherwise, the interpreter will lock. If we don't, because the main thread stops, but everyone else is still running, it's confusing. The threading code, again, this is the full code, so I don't usually like putting lots of code, but this is it.
So I thought it would be cool to go through it one by one. First and foremost, we add URLs to queue. I didn't mention the queue. We need the queue so that different threads can talk to each other. Not talk to each other, actually. Different threads can use something to get what they want to do next. Again, Python just gives this to you.
Most of you probably know this. Let's say we don't have to worry about it. You just create it at the top, unvisited URLs. First thing I do, I go and add the URLs to the queue so that our threads can then consume from it. You get an interesting case if you do that in a separate thread, too. You don't want to do that because your queue might never get full
and threads might read from it and then they think it's empty and your program ends, but it's not actually empty. So then we have a number of workers. We go through them, for each of them we create one thread. Give it the target function, which is visit URLs.
Basically it does what the sequential version did. It literally does that, except it gets the URL and it marks this task done, which is what you do on a queue. Queue to say the task is done. We create the worker threads and we start them. That does the actual work. That's it. To go back to how my CPU and my time is looking,
this would look something like that. This is not accurate, but if we had three threads, then you would have three threads. Once one of them is done with CPU and is waiting, we can go to the next one.
But it is being very vague here. Someone decides this is going to move on. In the same amount of time, we make better use of our resources. We do a lot more work. The yellow thing there is the global interpreter lock,
which we shall not talk about any more after this. But that's just to say the lock just makes sure there's only one thread being run on a core at a particular time. Again, our second core is skydiving. It's doing nothing. To look at the speed and the performance of this approach,
this is how it works. The x-axis, we have a number of threads. If we have one thread, it takes ages. It should probably even take more than the sequential version because there's a bit of overhead. By one thread, I mean not the main thread. I mean one extra thread created after the main thread.
We'll see that as we create more and more threads, this goes down. However, it does flat out after, I don't know, in this case, maybe between 11 to 17 threads. You're not really getting any more advantage, and that makes sense because by the time
the 17th thread comes up, there might not even be any things left for it to do. But this is only for 30 URLs. If we had a gazillion URLs, then that would flatten out a bit later. And so, okay, this is good. We have reduced the time. Probably, I think, sequential took about 30 or so seconds.
We've gone down to, what, at a good case, about five seconds, so that's okay. But we want to try multiprocessing now, see how that would perform. Multiprocessing, again, I assume most people are familiar. Hands up. Yay.
Yay. Um, with multiprocessing form processes, the actual processes, right, so they can just run on separate cores. And the cool part about it is that the API is very, very similar to threading, as in very similar. So it sidesteps the interpreter lock.
Oh, I still won't mention that again. This is, I promised the last time. And it's really easy to change our threading example to be a multiprocessing example. And to do that, this is the exact threading code. Only the highlighted lines are changed, right? So instead of getting Q from threading,
we get it from the multiprocessing module, and instead of a thread, we create a process. That's it. Everything else is the same. This is beautiful, right? So I just changed that in five seconds. Um, but the multiprocessing also gives something else, amongst many other things,
something that I'm going to talk about here, which is the pool object. And the pool object is a way to paralyze execution of function over a number of arguments. Right, so what this allows us to do is to, instead of changing our threading kind of code to use multiprocessing,
we can even change our sequential code to use multiprocessing. And again, this is the sequential code, as I showed you on the first slide. All you change is that you read the URL in advance, and you create a pool, and that pool will have a map method, and what map does it take to function, and a list of arguments.
But not a list of arguments, but one call to that function. Every time it calls the function, it gives one of those items in the list to it, and it says, do your thing. And you can give it a number of worker processes that you expect. So, to go back to the time and CPU kind of usage,
if we had two processes, this is hopefully how it would look, assuming that they would actually get scheduled to run on two separate cores. But the idea is that this should, multiprocessing should allow you to sidestep and be able to run it properly in parallel,
so true parallelism, hopefully. And if we had more and more processes, then this is not accurate, but this is how it would look. It's like having two of those threading things. It looks the same as the other graph.
But again, this is not exactly accurate, because it's processes. But you get much more work done, and you have lots of more cores, too. And the performance, very similar to threading, in terms of the way it reduces your number of tasks, but first and foremost for just one single,
if you have one single process, it takes longer than both sequential and threading, because the overhead is a lot to create the process. Thread is a bit less than sequential, compared to this, it's nothing. But again, you get a healthy drop. This was again used for 30 URLs, so after a certain point,
it's diminishing returns, it's not really doing much. So that's cool, too. But asyncio, right? I think asyncio is to Python, as big data is to middle management. I don't know, I think it's kind of, woo, asyncio.
It's a new module in Python 3.4, and it gives you the infrastructure for writing single-threaded code concurrently. It is meant to be quite low-level, and the point is that you can use other stuff like Tornado and Twisted on top of it.
I don't in this presentation, but it is quite low-level, and it's fairly compatible with everything else, well, except it's Python 3.4, mainly. And I'm just gonna, is anyone familiar with asyncio? Cool. So I'm gonna go through, asyncio has a lot of concepts.
I'm going to go through two of them, just because I think they're the most important ones, and also those are the ones that I'll be using in the code later on. One of them is that we have coroutines,
and coroutines are basically functions that can pause their execution in the middle of what they're doing, return, you know, something else does its work, and then you can go back to that function and carry on. So this should immediately remind you of yield, basically. It's like a generator, right, because it keeps its state,
you do something, it yields, you do something else, but then if you go back to it, then it continues from where it was and it keeps its state. These are what coroutines are. And the way they are used is that if you have three separate functions and you want to run them in a row, you run one, then you run the next one, then you run the third one, whereas with coroutines, you can say,
okay, I'm gonna run the first one until it needs to block. When it needs to block, well, you can yield. I can do my own stuff. I can run the second function, and then it does it the same way. I think this demonstrated well, where it does it in a row of three separate functions, but if you take the case of blue, it suspends because it's blocking,
so it gives a chance for other things to run, but they also suspend, too, halfway through. And when blue carries on, it's just making progress. It's not that it's starting again. It's just now it's stopped blocking. It's ready to go again so it can make progress. And also notice that these are not run in the same order,
round one and stuff. The order changes, so someone needs to keep track of how are the schedules and generally just keep track of all these co-routines going around, and that's where the event loop comes in. The event loop is in charge of keeping track of the co-routines. That's mainly the thing it does, and deciding which one's gonna go next.
I ran through this much quicker than I did last night when I tried this on. Okay, anyway. So the code for using asyncio looks like this. Yield from is new.
It's similar to... We don't talk about it, but yield from basically allows you a two-way channel of communication. What it does is that usually when you just do a yield, like a generator, it just returns something, whereas yield from allows you to kind of refactor
your generator out of your generator. It sounds weird. It probably doesn't make sense. Just don't worry about it. Delegation. Delegation. Yeah, delegation. It does that, too. So just to walk through what's happening here, first we get all our co-routines.
I do work as a co-routine, basically, you know, a function that's spent halfway through, blah, blah, blah. We first create all of them with all our URLs. Then we need an event loop, so from asyncio we can get an event loop, and then the run until complete method allows you to pass it a bunch of co-routines,
or futures, or whatever, in this case, co-routines, and it will run them, all of them, until they're complete. I don't asyncio that wait there, because I want to actually wait for everything to be completed first. The way do work works is that we first need to get the content of the URL.
At this point, this is fairly IO-y. So it yields from get URL content, which again, there's a lot of blocking there. So we can, while we're waiting for it to happen, we can just go back and run the next task,
and that would be okay. By the way, there's a lot of different ways of writing this task. I was trying to make the shortest possible one so I can fit all of it in one slide, but this is kind of how it works. And then the get URL, it has yields, so halfway through, if it's blocking, other stuff can carry on and do their work.
And the performance of this looks pretty cool. So with a number of URLs, it's pretty quick, right? And I think what's really cool about it is that the kind of line that it increases as you add more URLs is less steep than it was, let's say, in sequential case.
So this is quite promising, drum roll, for a winner. I'm just gonna put all the four different approaches that I used, well, sequential, I'm counting that as well,
next to each other to see how they performed for 30 URLs again, not a good idea, which is the whole point of doing stuff like this. But we can see sequential is just not gonna happen. And threading multiple sync IO, they are all fairly good. I tried running this on lots and lots of more URLs,
and async IO did outperform properly in this case. But again, you have to take them with a pinch of salt because they are very dependent on the task that you're doing. For IO-bound tasks, you can use threading, you can use async IO, and that's fine. But if this was a CPU-bound task,
threading wouldn't stand a chance because of, you know. And async IO, it wouldn't do that well either, as far as I understand it. So multi-crossing would be the answer. So this, I'm tempted to conclude here,
but I don't like making conclusions when it comes to parallelism. I think the whole point of it is every single task, every time I've come across a separate task, it's just different. You have to look at how is it, is it IO-bound, is it CPU-bound, but how IO-bound is it actually? So I can't make a conclusion and say,
well, use coroutines always, or async IO, or whatever. I think you have to be pragmatic about the task that you have at hand and just play with it a bit to see which one works well. So I'm definitely not gonna say, I wouldn't use this slide to say, ooh, async IO is much better.
No, it's not. It really depends on what you're doing and the type of competition you're doing. So this was meant to be a half an hour talk. I don't know why it's 20 minutes and 48 seconds, but this is it. Sorry to disappoint.
Just to waste another minute. Please, if you want to give me feedback, other than your talk was too quick, anything else, please get in touch.
We can talk about it. If you want to try other stuff with my code and some other resources, I've put together some great links and videos of stuff that will be on that URL on GitHub right after this talk. I'll do it when I get out there. It's there, I'll just have to make it public. Yeah, this is it really.
Q and PA. We have a bit of time though. We have time for like 4,000 questions. So did you ever do something crazy like combine these techniques, have multiprocessors that run threads and use async IO in the threads? So the idea for this talk,
initially when I proposed it, I was to, at the end, do something like that. But then I didn't have time. Yeah, so I didn't realize I would have like 12 minutes of doing crazy stuff. So no, I didn't. Sorry.
Hi, thanks for the talk. It was very interesting, quite concise. There's something that really puzzles me. Can you please show again the slides with the threading of the multiprocessing time? I really don't understand why when you have one thread,
it takes, we'll see the number. Sorry, one sec. Is this one okay? You want the diagram? The times, please. Let's check this out. The next one, I think. Oh, so whichever the process is, the next one is. Oh, yeah. This one, yeah. I really don't understand why we have one process.
It takes more than 30 seconds. Oh, yeah. Sue, what did I do wrong? No, no, no, this is one process for 30 URLs, right? I mean, so in the sequential version, if you want to download 30 URLs,
this is basically sequential. So each URL takes roughly about one second just under. Does that make sense? You scared me there for a second. I thought I got my axes wrong. Okay, thank you.
Guys, we can talk about life, too, if you've run out of questions, because we have another 15 minutes. About the life and everything 42, but besides that, what about gevent and green threads? So you can use stuff like gevent on top of asyncio, which is pretty cool. I haven't done it, so you could definitely use it.
You could have tornado, twisted, everything. I think the way asyncio was designed was for other frameworks like that to be able to build on top of it. That's why it's quite low level. But I don't have any chart performances for it.
What I can do, however, if you're interested, if anyone else is, I can do that, add it to slides, and when I put the slides online, it could be included. Is that okay? That would be great. Okay, yeah. Gevent does not run on top of asyncio. There we go. It's its own event loop. It's completely different.
I'm wrong. Okay, well, yeah, I have to look into that again. Yeah, it's not, right? But I was under the impression you could use them together.
I was just saying that you said gevent, tornado, and twisted. So gevent does not fit in there, but tornado and twisted definitely do. So they can run on top of asyncio. But gevent is really a thing that is more level and does its own way of getting it to work.
Correction, you can't run gevent on top of asyncio. Ah, okay, good to know. I just wanted to make sure everyone's on the same page here. So, yeah. Good, you're paying attention at least. Not that I have the miper anyway. So one thing to add, you presented yield from as the way to do delegation in asyncio.
So Python 3.5 will have a new syntax for that, which was used, it was changed before, because there were problems with yield from and mixing them with generators, and so generators don't work, but coroutines do, and so that was all a bit of a mess.
And so there's new syntax now that goes coroutines are defined with async def. So there's new syntax for defining coroutines, and then instead of yield from inside of such a coroutine, you would use await instead of yield from. So I had, yeah, I actually opened a post,
it's in the resources to show you, got a greedy response to why he chose yield from and how that didn't happen. But that's also in the resources, it's a good read. He just goes through why he chose yield from and not await, and then, as you mentioned, guys, we have at least 20 more minutes.
Yeah, thank you. Just something really quick. What about memory overhead, because I have been pushing processes can, or even threads can take much more memory than async IO, for example? Yeah, they can. Thank you. So, but again, I don't wanna,
I really get nervous when I have to do a conclusion like that, though. Yes, they do, but I think you have to look at your tasks too, to see what it does. But yeah, I mean, they do. About the overheads, did you happen to get a chance of testing something, like, say, a four-core hyper-threading processor
for the multiprocessing? I didn't. Maybe I should. Might be interesting to see. I feel like I've disappointed a lot of people today by not doing all this. Guys, time is finite. Yeah. If I can do it, do you know what? That's great. I wanna put this code up, well, it is up,
I wanna make it public, and, like, just feel free to contribute, and next year I can do this same talk and it can take longer. Any more questions? Lunchtime.