asyncio: We Did It Wrong
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Alternative Title |
| |
Title of Series | ||
Number of Parts | 132 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/44928 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Chaos (cosmogony)SimulationMessage passingLoop (music)MassInformationComputer fileEvent horizonRandom numberException handlingTask (computing)Grand Unified TheoryCodeManufacturing execution systemFingerprintConcurrency (computer science)String (computer science)System callInterrupt <Informatik>DatabaseCanadian Light SourceIntelSerial portPartial derivativeExtension (kinesiology)Control flowKeyboard shortcutProcess (computing)Queue (abstract data type)Lambda calculusInterior (topology)Term (mathematics)Fault-tolerant systemDefault (computer science)Client (computing)Duality (mathematics)Thread (computing)Electric currentStreaming mediaComputer-generated imageryRollback (data management)Thread (computing)Interrupt <Informatik>Exception handlingMessage passingEvent horizonRandomizationTask (computing)DatabaseOcean currentLoop (music)System callUniform boundedness principleMereologyService (economics)Group actionExtrapolationArithmetic progressionBefehlsprozessorComputer-assisted translationCodeMalwareNeuroinformatikError messageProgramming paradigmMeasurementDisk read-and-write headData managementComputer programmingLevel (video gaming)Computer simulationoutputMetric systemBlock (periodic table)NumberBuildingEntire functionPrice indexProduct (business)Point (geometry)Instance (computer science)Axiom of choiceCubeCASE <Informatik>System administratorProjective planeCombinational logicMultiplication signSelf-organizationFunctional (mathematics)StatisticsConcurrency (computer science)MultiplicationLibrary (computing)BitWeb crawlerCountingOnline helpDemosceneParallel computingOrder (biology)Graph coloringWordCore dumpKeyboard shortcutMach's principleStapeldateiCoroutineQueue (abstract data type)Sinc functionLine (geometry)Object (grammar)Right angleChemical equationAddress spaceSynchronization2 (number)WebsiteHoaxChaos (cosmogony)Arithmetic meanLink (knot theory)Connected spaceProcess (computing)Directed graphSpeech synthesisComputer fileLaptopCodecProgrammer (hardware)Wrapper (data mining)Military baseModule (mathematics)Scripting languageSoftware developerQuicksortAbstractionInformation securityWeightGame controllerMultilaterationTerm (mathematics)DemonResultantSingle-precision floating-point formatProgrammschleifeTheory of everythingCovering spacePrimitive (album)Erlang distributionFamilyDifferent (Kate Ryan album)UsabilityGoodness of fitShift operatorElectric generatorEvent-driven programmingDirect numerical simulationDefault (computer science)Physical systemPerspective (visual)Open set1 (number)Fitness functionState of matterPay televisionBoilerplate (text)TelecommunicationUniqueness quantificationSpherical capEnterprise resource planningSequenceHookingFlow separationComputer virusRootExtension (kinesiology)Context awarenessWeb pageComputer animation
Transcript: English(auto-generated)
00:06
Awesome. Hello. Welcome. I am Lynne root. I work for Spotify And basically, I'm a site reliability engineer at Spotify And what that means is I either break our entire service or get page to fix it when other people do
00:23
In actuality what an SRE does at Spotify is because it varies widely among different companies It's a combination of them back-end development where my team and I run a few services that other engineers use daily And plus a little devops and sysadmin. I
00:40
am also our FOSS evangelists and I help a lot of teams release their projects and tools under the Spotify GitHub organization And lastly I help lead PyLadies Which is a global mentorship group for women and friends in the Python community and if you want stickers I have stickers to give away. Just find me afterwards
01:01
So before I start I want to warn you all That I will take all the time allotted for this talk. So there's gonna be no time for Q&A You might think I've purposely done this maybe to avoid Q&A, but I will be here Afterwards for the whole conference so you can just find me And to chat and there will also be a link at the end
01:23
For the whole notebook and the example code that I use All right, so let's get started Async.io the concurrent Python programmers dream, I guess the answer to everyone's asynchronous prayers The async.io module has various layers of abstraction and allowing developers as much control as they need and are comfortable with
01:44
And we have simple hello world examples and they make it look so effortless But it's easy to get lulled into this false sense of security This isn't exactly that helpful, right? It's all fake news
02:01
We're led to believe that we're able to do a lot with a async and await API layer some tutorials while great for getting developers toes wet Try to illustrate real-world examples, but they're just beefed up. Hello world examples Some even misuse parts of async.io's interface allowing one to easily fall into the depths of callback hell
02:23
And some get you up and running easily with async.io But then you might realize that it's not correct or not exactly what you want or only gets you part of the way there So while some tutorials walk through And walkthroughs do a lot to improve upon the basic hello world use case. Sometimes they're just a basic web crawler
02:44
And I don't know about you, but at Spotify, I'm not building web crawlers But in general and asynchronous programming is just difficult Whether you use async.io or twisted or tornado or even Golang or Erlang or Haskell It's just difficult. And so within my team at Spotify we which is just mostly me
03:04
Fell into this false sense of ease that the async.io community builds The past couple of services that we built we felt were good candidates for async.io One of them was a chaos monkey like service for restarting Instances at random and another is an event driven
03:23
hostname generation service for our DNS infrastructure So sure, we needed to make a lot of HTTP requests that should be non-blocking But these are services that needed to react to pub sub events to measure the progress of the actions initiated from those events
03:42
Handle any incomplete actions or external errors Deal with the whole pub sub message lease management Measure service level indicators and send metrics and then we also needed to use some non async.io friendly dependencies So it got difficult quick
04:01
So Allow me to provide you with a real-world example that actually comes from the real world If you get the pun, we're building a chaos monkey Mandrill is a monkey So We did build a service that does periodic restarts of our entire fleet of instances at Spotify
04:21
And we're gonna do that here here We're gonna build a service called Mayhem Mandrill and which will listen for a pub sub message and restart a host based off of that message As we build the service, I'll point out traps that I may or may not have fallen into And this will essentially become a resource that I would have liked about a year ago
04:42
So at Spotify we do use a lot of Google products and in this case a Google pub sub But there are a lot of choices out there and we're just going to make Simulate a simple pub sub like technology with async.io This tutorial is a quite easy and fun to read I guess
05:01
And this is where we're starting off with a very simple like publisher where we're creating a set number of instances and then adding to the queue and Then we are consuming that queue and it's very easy to run especially with the latest Python 3.7 syntactic sugar And so when we run this we see that we were able to publish and consume messages
05:26
So we're gonna work off this As you might notice Little teaser we're not actually we haven't actually built a running service We're merely just like a pipeline or a batch to drop right now
05:41
So in order to continuously run we have to use the loop.run forever For this we have to schedule and create tasks out of coroutines and then start to loop and since we created and started the loop and we should clean it up too. So when we Run this updated code We get this nice little trace back
06:04
And Then it kind of hangs so we have to cancel it you have to interrupt it so yeah, that's nice and ugly, right? So we should probably try and clean that up So we should try and run a bit defensively and we'll first address the exceptions that arise from coroutines
06:22
So we'll just go ahead and fake an error. Oh that did not come out well, hopefully you can still read that We're gonna fake an error. So like the fourth message will be an error So if you run it as is we do get an error line And it says exception was never retrieved. And so admittedly this is a part of the async IO API
06:43
That's not very friendly. If this was synchronous code and we'd simply get the error that was raised But it gets swallowed up in an unretrieved task So to deal with this as advised by the docs We'll need to have a wrapper around this coroutine to consume the exception and stop the loop
07:03
So we'll make a little top level wrapper on handle the exceptions of the coroutines and so when we run our script Just like that. So when we run our script, it's something a little bit cleaner I'm gonna stop right now I'm just going to quickly review so so far setting up an async IO service
07:21
We want to surface the exceptions so that you can like retrieve them and then clean up what you've created And I'll expand on both of these sorts of part both of these parts a bit later, but it's a clean enough for now So so far we've
07:41
We've seen we've seen a lot of tutorials that use the async and await Make use of async and await keywords while it's not blocking the event loop We're still literally iterating through tasks serially effectively not adding any concurrency So if we take a look at our script now
08:00
We're serially processing each item that we produce and then consume Even if the event loop isn't blocked and there will be other tasks and curtains going on. They of course wouldn't be blocked But this might be obvious to some but it isn't all we are blocking ourselves We first produce all the messages one by one and then we consume them one by one
08:20
The loops that we have within the publish and consume coroutines we block ourselves from moving on to the next message While we await to do something So this is technically a working example of a pub sub like Q with async IO and it's not really what we want We're here to build an event driven service or maybe even a batch or
08:41
Pipeline job, we're not really taking advantage of the concurrency that async IO can provide So as an aside, I find async IO's API actually quite user-friendly despite what some people might think It's very friendly or very easy to get up and running with you that event loop When first picking up concurrency this async and await syntax makes it a very low hurdle to start using
09:05
Since it makes it very similar to writing asynchronous code But again, it's picking up concurrency this API is deceptive and misleading Yes, we are using the event loop and primitives. Yes, it does work. Yes, it might seem faster
09:21
But it's probably because you came from 2.7 Welcome to 2014, by the way To illustrate that there's no difference in synchronous code This is the same script and removing async IO primitives and using just synchronous code And you can see just looking at the consumer and there's no real difference other than a couple of the weights
09:44
And when we run it, it's pretty much the same and the only difference is actually the randomness part So part of the problem could be the documentation and tutorial writers are presuming knowledge and the ability to extrapolate over simplified examples But it's mainly because concurrency is just a difficult
10:01
Paradigm to grasp in general we write our code as we read anything left to right top to bottom Most of us are not are just not used to multitasking and context switching that our modern computers allow us Hell, even if we are familiar with concurrent programming understanding a concurrent system is just hard
10:21
But we're not in over our heads yet and we can still make this simulated chaos monkey service actually concurrent in a rather simple way So to reiterate our goal here we want to build an event-driven service that consumes from pub sub process the messages as they come in We could get like thousands of messages a second And so as we get a message, we shouldn't block the handling of the message of the next message we receive
10:45
So to help facilitate this We will also need to build a service that actually runs forever. We're not going to have a preset number of messages We'll need to react whenever we're told to restart an instance And so the triggering event to publish a restart request could be an on-demand request from a service owner
11:05
Or it could be a gradually scheduled gradually rolling restart the fleet. You don't know so We'll first mock the publisher To always be publishing restart message requests and therefore and never indicate that it's done
11:23
This also means that we're providing not providing a set number of messages to publish so I had to rework this function here I'm just adding on the creation of unique ID for each message. That's produced So when running it it like happily produces messages But you might notice that there is that
11:41
Keyboard interrupt exception triggered by the control scene and we don't actually catch that so we can quickly clean that up And this is just a band-aid and I'll explain that further on But now we see something much cleaner So it's probably hard to see why it's concurrent right now So to help we're gonna add multiple producers to see the concurrency
12:04
For that publish function. I'm going to add a publisher ID and have it in our log messages and then create three publishers just real quick and then when we run we can see that we have a bunch of Publishers going on and concurrently so for the rest of the walkthrough. I'm actually just going to remove all those multiple publishers
12:23
I don't want to confuse anything Now on to the consumer bit so for this goal is to come we're constantly consuming messages from a queue and To create non blocking work base off a newly consumed message in this case to restart an instance and the tricky part is that the consumer needs to be written in a way that the consumption of the
12:45
Message from the queue is separate from the work that happens for that message so in other words We have to simulate being event driven by reacting or by regularly pulling messages from a queue Since there's no way to trigger work based off of a new message available in that queue
13:02
There's no way to be a push based So we'll first mock restart work that needs to happen whenever we consume a message And we'll stick it in our well true loop and await for the next message on the queue and then pass it off to restart host and then we'll just add it to our loop and
13:23
Then when we run it you see that Messages are being pulled and restarted We may want to do more than one thing per message for example We might want to store the message in a database for potentially replaying later as we initiate a restart of a given host
13:42
So within the consume function we could just add the await for both co-routines And we'll see that it happens just fine that both are saved and restarted But We still kind of block the consumption of the messages and we don't necessarily need to await one co-routine after another
14:02
These two tasks don't necessarily need to depend upon one another Completely sidestepping the potential concern for should we restart a host if we haven't saved any database That's for another time But we can treat them as such so instead of awaiting them We can only create a task to have them scheduled on the loop and basically chucking it over to the loop for
14:22
for it to execute when it can And so now we have like restart and save not necessarily and serially but whenever the loop can can execute the co routine As an aside sometimes you do want your work to happen serially and maybe you restart hosts
14:40
That have an uptime of more than seven days Or maybe you want to check a balance of an account before you debit it Needing to code needing code to be serial or having steps or dependencies That doesn't mean that you can't be asynchronous the await last restart date will yield to the loop But it doesn't mean that the restart host will be the next thing that the loop
15:01
Executes it just allows other things to happen outside that code routine And yes, I admit this was a thing that wasn't immediately apparent to me at first So we've pulled the message from the queue and we found out fanned out work based off of that message We now need to perform any finalizing work on that message. So for example, we might need to acknowledge the message
15:23
So it's not redelivered We'll separate this out Separate out the point of the message from creating work off of it And then we can make use of asyncio.gather to add a callback So when we run it You can once so once both the save coroutine and the restart coroutine are complete the cleanup will actually
15:47
Will be called and that signifies that the message is done However, I'm a bit allergic to callbacks As well and perhaps we need a cleanup to be non-blocking. So then we can just await it
16:04
Shoot there you Know much like a Google Pub sub Let's say that the publisher will redeliver a message and after 10 seconds if it has not yet been acknowledged And but we are able to extend that message deadline in order to do that We have to have a coroutine that in essence monitors all the other worker tasks
16:22
So while we are continuing to do work this coroutine will extend the message acknowledgement deadline Then once we're done, we should stop extending the deadline and then clean up the message So one approach is to make use of asyncio event primitives where we can create event and then pass it to our extend
16:41
Coroutine function and then set it when we're done And you can see that it's extending and then it stops extending when the message is actually done And if you really like events, you can make use of event dot wait and move the cleanup Outside and so now we got a little bit of concurrency going on
17:05
To review real quick. Um, asyncio is pretty easy to use, but it doesn't automatically mean they're using it correctly You can't just throw around async and await keywords around blocking code It's actually a shift in your mental paradigm
17:20
Both with needing to think of what work can be farmed out and let us do this thing Then you have to think about what dependencies are there and what or your code might still need to be sequential But having steps in your code Like first a and then B and C it might seem like it's blocking when it's not
17:40
Sequential code can still be asynchronous For instance, I might have to call like customer service at some point But I'm gonna be on hold for a while so I can just put it on speakerphone and then go play with my super needy cat So I might be single-threaded as a person but I can definitely multitask like CPUs
18:01
So earlier we added try accept finally around our main event loop code Although you probably want your service to gracefully shut down if it receives a signal of some sort Like cleaning up open database connections and stop consuming messages and finishing responding to current requests Well not accepting new ones
18:21
So if we happen to restart an instance of our own service we should clean up the mess before we exit out And so we've been catching this commonly known like keyboard interrupt exception like many other tutorials and in libraries But there are other signals that we should be aware of Typical ones are like SIG up and say quit in term
18:43
There's kill and stop it. We shouldn't like catch them or block them or ignore them So if we run our current script as it is And give it a term signal where we find ourselves not actually entering that finally clause Where we like log and clean everything up
19:04
So we basically got to be aware of where those exceptions happen I also want to point out that even though we're only ever expecting keyboard interrupt It could happen outside of catching the exception potentially causing the service to end up in an incomplete or otherwise unknown state
19:26
So instead of catching keyboard interrupt, let's attach a signal handler to the loop So first we'll define that shutdown behavior that we want. We want to simulate Database connections and returning messages to pub sub as not acknowledged
19:40
And so that they can be redelivered and not just dropped and actually canceling tasks We don't necessarily need to cancel pending tasks. We could just collect them and allow them to finish It's up to what we want to do We also might want to take this opportunity to flush any collected metrics, so they're not lost So we need to hook this up to our main event loop now
20:03
I also removed the keyboard interrupt catch since it's now taken care of within the signal handling And so we run this again and send it the term signal And it looks like it cleaned up, but you see that we have this caught exception error twice This is because waiting on canceled tasks will raise the asyncio cancel error, which is to be expected
20:25
And we can add that to our little handle exception wrapper as well So if we run it, we actually see our coroutines are being canceled and not just some random exception So you might be wondering which signals should you care about
20:42
Apparently there is no standard. Basically, you should be aware of how you're running your service and handle them accordingly Also as a heads up Another misleading API in asyncio is shield. The docs say that a means to shield a future from cancellation
21:03
I have a core dev right here But If you have a coroutine that must not be canceled during shutdown, shield will not help you This is because the task that shield creates gets included in asyncio.alltasks
21:20
And therefore receives cancellation signal just like the rest of the tasks So help illustrate. I have a little simple async function with like a long sleep that we want to shield And then when we run it and cancel it before the 60 seconds We don't we see that we don't ever hit the done line and that it's immediately canceled
21:41
So, yeah, that's fine So TLDR We don't actually have nurseries to an asyncio core to clean up ourselves It's up to us to be responsible and close up to the connections and files that we open Responding to outstanding requests and basically leaving things how we found them
22:01
So doing our cleanup in a final clause isn't enough since a signal could be sent outside of a try-except clause As we Construct a loop we should tell how it should be deconstructed as soon as possible And it ensures that all of our bases are covered and we're not leaving any artifacts around
22:23
And finally we need to be aware of when our program should shut down Which is closely tied to how we run our program if it's just a manual script then SIGINT is fine If it's a demonized docker container, then SIGTERM is probably more appropriate May have noticed that we're not actually catching exceptions within like restart host and save just on the top level
22:48
So to show you what I mean, we're gonna fake an error where we can't restart a certain host So running it we see that a host can be restarted and while the service did not crash It did save to the database and and it did did but it did not clean up or act the message and the
23:05
Extend on the message deadline will also keep spinning. So we've effectively deadlocked on the message a Simple thing to do is to add return exceptions true within our asyncio gather So rather than completely dropping an exception we can turn it with our successful results
23:22
However, you can't really see What actually? Aired out so What we could do is add a callback But as I said, I'm allergic So we can just add a little like helper function to process the results afterwards And so when we use something like this
23:42
We can see errors are now logged and we can handle them appropriately So quick review and exceptions do not crash the system Unlike asyncio programs and they might not asyncio programs and they might go unnoticed and so we need to account for that and
24:02
I personally like using asyncio gather because the order of the return results are deterministic But it's easy to get tripped up with it by default It will swallow exceptions and but happily continue working on other tasks And if an exception is never returned and weird behavior can happen Like spinning around an event
24:22
Alright, so I'm sure folks As you started using asyncio You might have realized that asyncio or async await starts infecting the rest of your code base Everything needs to be async and it's not necessarily a bad thing. It's just forces a shift in perspective So for a code to work with this we need to sort of maybe rework
24:43
our consumer Not much is needed actually I'm still making use of a synchronous consume coroutine to call a non-async consumer and Using a thread pool executor to run that code As an aside, there's actually a handy little package called asyncio extras, which provides a decorator for asynchronous functions
25:04
Where it would remove the boilerplate for you and you can just await the decorated function But sometimes third-party code throws a wrench at you If you're lucky you're faced with a third-party library that is multi-threaded and blocking
25:21
So for example, Google PubSubs Python library makes use of gRPC under the hood with threading But it also blocks when opening up a subscription and it also requires a non-async callback for when a message is received So in typical Google fashion, they have some uber cool technologies and slightly difficult to work with libraries
25:43
And so this feature that they return it makes use of gRPC for bi-directional communication and it removes the need for us to periodically pull for messages as well as manage message deadlines and so to illustrate we can use run an executor again
26:00
I've made a little helper function to kick off the consumer and and the publisher And to prove that this is now non blocking I'm going to create a little dummy coroutine to run alongside run PubSub and we'll add the two We'll add the two coroutine functions and update the main and so it's just the run
26:23
Method or run function that we're running and we can see that it's not blocking But as I said that although it'll do a lot for us There's a lot of threads in the background and like 15 threads in the background that the Google PubSub library gives us So I'm going to reuse that something else coroutine to actually periodically get some stats on the threads
26:46
That's going on and I've also prefixed our own thread pool executor so I can easily tell which one I created versus what Google created And when running this you can see that Google creates a lot of a lot of threads
27:01
We have the main thread Which is the async IO event loop and there's five threads from us because we've given it five workers And then the rest of it is is Google. And so this current thread count is like 22 But all in all like the the approach to threaded code, isn't that different than non async code?
27:23
Until you realize that you have to call asynchronous code from non async function within that thread So obviously we can't just act a message once we receive it We have to restart the required host and save the message in our database So basically you have to call asynchronous code from a non async function in a separate thread
27:42
Pretty embarrassing bear with me and I got like two minutes to run through this. We'll use the async IO create task That we defined earlier then we realized that yes, of course, there's no event loop running and to get a little more color Here's some log lines that yes indeed no event loop is running in the thread
28:05
Yeah, it's I can hear people say read read the docs line But what if we gave it the loop that we're running in? And it kind of works, but it's deceptive. We're just we're just lucky here Once we share an object between the threaded code and the callback and the asynchronous code. We've essentially shot ourselves in the foot
28:25
And to show you that I've created a global queue that The consumer will add to and then we'll read off that queue with handle message and You see something funky now Nothing is ever being consumed from that global queue
28:42
And so if we add a line in that queue and that function we can see the queue size gradually increasing And I'm sure a lot of you see what's going on here. We are not thread safe So let's make use of run coroutine thread safe and see what happens Yes, it finally fucking works
29:03
So in my opinion it's not that difficult to work with synchronous code in async IO However, it is difficult to work with threads particularly with async IO so if you must use the thread safe API is and that async IO gives you or You can just hide away and try to ignore it
29:23
So in essence this talk is something that I would have liked to hear about a year ago So I'm speaking to pass line here, but hopefully there are others that benefit from this from a use case That's not just a simple web crawler Everything is up there up on that URL. So hopefully it's useful to folks. Thank you