AV-Portal 3.23.3 (4dfb8a34932102951b25870966c61d06d6b97156)

asyncio: We Did It Wrong

Video in TIB AV-Portal: asyncio: We Did It Wrong

Formal Metadata

asyncio: We Did It Wrong
Alternative Title
asyncio in Practice: We Did It Wrong
Title of Series
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Release Date

Content Metadata

Subject Area
This talk is aimed at those that have at least intermediate experience in Python and have played around with asynchronous Python using asyncio or other libraries. I want the audience to learn from my mistakes! For instance, how easy it is to get into “callback hell” (and how to avoid/get out of it), how to screw up thread safety and deadlock yourself, and making code async but not actually concurrent. I'll talk through some anti-patterns and best practices that I learned the hard way. This includes proper concurrency, calling coroutines from synchronous code, working with threads and threadsafety, properly shutting down an async programs, and hidden "gotchas".
Laptop Group action Service (economics) Concurrency (computer science) Link (knot theory) Code Multiplication sign System administrator Projective plane Combinational logic Programmer (hardware) Self-organization Directed graph
Point (geometry) Axiom of choice Building Web crawler Game controller Group action Randomization Service (economics) Hoax Chaos (cosmogony) Parallel computing Mereology Event horizon Usability Direct numerical simulation Malware Goodness of fit Energy level Message passing Theory of everything Error message Information security Module (mathematics) Electric generator Software developer Chaos (cosmogony) Price index Instance (computer science) Measurement Entire function Uniform boundedness principle Data management Message passing Personal digital assistant Event-driven programming output Metric system Simulation Arithmetic progression Abstraction
Random number Service (economics) Hoax Concurrency (computer science) Code Coroutine Mereology Event horizon Number Queue (abstract data type) Information Message passing Error message Loop (music) Address space Task (computing) Exception handling Scripting language Stapeldatei Wrapper (data mining) Weight Computer file Mass Bit Line (geometry) Instance (computer science) Multilateration Message passing Loop (music) Event horizon Cube Order (biology) output Right angle Quicksort Exception handling Simulation Task (computing)
Intel Keyboard shortcut Randomization Manufacturing execution system System call Concurrency (computer science) Code Multiplication sign Coroutine 1 (number) Database Primitive (album) Chaos (cosmogony) Open set Mereology Disk read-and-write head Computer programming Neuroinformatik Different (Kate Ryan album) Synchronization Single-precision floating-point format Information Process (computing) Exception handling Scripting language Covering space Programming paradigm Extrapolation Block (periodic table) Concurrency (computer science) Computer file Keyboard shortcut Computer simulation Mach's principle Bit Instance (computer science) Control flow Demoscene Connected space Message passing Arithmetic mean Process (computing) Befehlsprozessor Order (biology) Interrupt <Informatik> output Website Quicksort Simulation Task (computing) Sinc function Point (geometry) Random number Game controller Functional (mathematics) Service (economics) Online help Event horizon Canadian Light Source Number 2 (number) Programmschleife Term (mathematics) Queue (abstract data type) Interrupt <Informatik> Computer-assisted translation Message passing Loop (music) Task (computing) Serial port Stapeldatei Multiplication Shift operator Weight Uniqueness quantification Chemical equation Code Database Grand Unified Theory Uniform boundedness principle Word Loop (music) Event horizon Personal digital assistant String (computer science) Partial derivative Codec Exception handling Family Fingerprint Library (computing) Extension (kinesiology)
Randomization Functional (mathematics) Service (economics) State of matter Coroutine Database Disk read-and-write head Event horizon Spherical cap Core dump Information Message passing Error message Loop (music) Lambda calculus Exception handling Task (computing) Wrapper (data mining) Keyboard shortcut Interior (topology) Computer simulation Database Term (mathematics) Connected space Fault-tolerant system Message passing Loop (music) Interrupt <Informatik> output Right angle Queue (abstract data type) Metric system Task (computing) Sinc function
Demon Functional (mathematics) Service (economics) Computer file Database Computer programming 2 (number) Term (mathematics) Energy level Information Message passing Error message Loop (music) Exception handling Scripting language Military base Computer file Database Line (geometry) System call Connected space Message passing Loop (music) Event horizon Exception handling Task (computing) Resultant Extension (kinesiology)
Thread (computing) Code Coroutine Database Rollback (data management) Perspective (visual) Computer programming Duality (mathematics) Synchronization Information Physical system Exception handling Covering space Concurrency (computer science) Computer file Fitness function Streaming media Thread (computing) Message passing Data management Telecommunication Order (biology) output Queue (abstract data type) Electric current Directed graph Ocean current Functional (mathematics) Statistics Pay television Computer-generated imagery Event horizon Graph coloring Queue (abstract data type) Message passing Loop (music) Task (computing) Default (computer science) Default (computer science) Shift operator Weight Boilerplate (text) Code Counting Client (computing) Database Line (geometry) System call Uniform boundedness principle Loop (music) Event horizon Codec Object (grammar) Enterprise resource planning Exception handling Library (computing)
Web crawler Uniform resource locator Personal digital assistant Speech synthesis Line (geometry)
awesome hello welcome I am Lynn root I work for Spotify and basically I'm slight reliability engineer at Spotify and what that means is I either break our entire service or I get paged to fix it when other people do in actuality what an SRE does at Spotify is because varies widely among different companies it's a combination of them back-end development and we're my team and I run a few services that uh other engineers use daily I'm plus a little DevOps in sysadmin I am also a fossa Vangelis and I helped a lot of teams release their projects and tools under and thus modified github organization and lastly and I helped to lead pilotis which is a global mentorship group for women and friends in the Python community and if you want stickers I have stickers to give away just find me afterwards so before I start I want to warn you all that I will take all the time allotted for this talk so there's gonna be no time for Q&A you might think I've purposely done this maybe to avoid Q&A but I will be here afterwards for the whole conference so you can just find me and to chat and there will also be a link at the end for the whole notebook in the example code that I use all right so let's get started a sink i/o the concurrent Python programmers dream I guess the answer to
everyone's asynchronous prayers the async i/o module has various layers of abstraction and allowing developers as much control as they need and are comfortable with now we have simple hello world examples and make it look so effortless but it's easy to get lulled into this false sense of security this isn't exactly that helpful right it's all fake news we're led to believe that we're able to do a lot with a async and await API layer some tutorials well great for getting developers toes wet try to illustrate real-world examples but they're just beefed up hello world examples some even misuse parts of async iove's interface allowing one to easily fall into the depths of callback hell and some get you up and running easily with async IO but then you might realize that it's not correct or not exactly what you want or only gets you part of the way there so while some tutorials walkthrough and walkthroughs do a lot to improve upon the basic hello world use case sometimes they're just a basic web crawler and I don't know about you but at Spotify I'm not building web crawlers but in general and asynchronous programming is just difficult whether you use async IO or twisted or tornado or even golang or Erlang or Haskell it's just difficult and so within my team at Spotify we which is just mostly me fell into this false sense of ease that the async i/o community builds the past couple services that we built we felt we're good candidates for async I am one of them was a chaos monkey like service for restarting instances at random and then another is an event-driven host name generation service for a DNS infrastructure so sure we needed to make a lot of HTTP requests that should be non-blocking but these are services that needed to react to pub sub events to measure the progress of the actions initiated from those events handle any incomplete actions or external errors deal with the whole pups of manage message lease management measure service level indicators and send metrics and then we also needed to use some non async i/o friendly dependencies so got difficult quick so allow me to provide you with a real-world example that actually comes from the real world if you get the pun and we're building chaos monkey mandrill is a monkey so we did build a service that does periodic restarts of our entire fleet of instances that's modifying and we're gonna do that here here we're gonna build a service called mayhem mandrel and which will listen for a pub/sub message and restart a host based off of that message as we build this service I'll point out traps that are may or may not have fallen into and this will essentially become a resource that I would have liked about a year ago so at Spotify we do use a lot of Google
products and in this case a Google pub/sub but there are a lot of choices out there and we're just gonna make simulate a simple pub/sub like technology with async I own this tutorial is a quite easy and fun to read I guess and this is where we're starting off with a very simple like publisher
we're recreating a set number of instances and then adding to the queue and then we are consuming that cube and it's very easy to run especially with the latest Python 37 syntactic sugar and so when we run this and we see that we were able to publish and consume messages so we're gonna work off this as you might notice a little teaser we're not actually we haven't actually built a running service we're merely just like a pipeline or a batch job right now so in order to continuously run we have to use the loop dot run forever for this we have to schedule and create tasks out of care routines that then start to loop and since we created and started the loop and we should clean it up too so when we run this updated code we get this nice little trace back yep and then it kind of hangs so we have to cancel it you have to interrupt it so yeah that's nice and ugly right so we should probably try and clean that up um so we should try and run out a bit defensively and we'll first address the exceptions I rise from co-routines so we'll just go ahead and fake an error not come out well hopefully you can still read that we're gonna fake an error so like the fourth message will be an error so if you run it as is we do get an error line and it says exception was never retrieved and so admittedly this is a part of the async i/o API that's not very friendly and if this was synchronous code and we'd simply get the error that was raised but it gets swallowed up in a nun retrieve task so to deal with this as advised by the docs we'll need to have a wrapper around this care routine to consume the exception and stop the loop so we'll make a little top-level wrapper on handle the exceptions of the co-routines and so the one we've run our script just
like that so when we run our script it's something a little bit cleaner right now I'm just gonna quickly review so so far setting up an async i/o service you want to surface the exceptions so that you can retrieve them and then clean up what you've created and I will expand on both of these sorts apart both of these parts a bit later but it's a clean enough for now so I'm so far we've we've seen we've seen a lot of tutorials that use the async in a weight and make use of facing animate keywords well it's not blocking the event loop and we're still literally iterating through tasks serially effectively not adding any concurrency so if we take a look at our script now versus eerily processing each item that we produce and then consume even if the event loop
isn't blocked and there will be other tasks and curtains going on they of course wouldn't be blocked but this might be obvious to some but it isn't - all we are blocking ourselves we first produce all the messages one by one and then we consume them one by one the loops that we have within the publish and consume her routines we block ourselves from moving on to the next message while we await to do something so this is technically a working example of a pub sub like you with a single and it's not really what we want we're here to build an Avenger service or maybe even a batch of pipeline job we're not really taking advantage of the concurrency that async IO can provide so as an aside I find ASIC iOS API actually quite user family despite what some people might think it's very friendly or very easy to get up and running with you that event loop when first picking up concurrency this async and a weight syntax makes it a very low hurdle to start using since it makes it very similar to writing asynchronous code but again it's a picot concurrency this API is deceptive and misleading yes we are using the event loop and primitives yes it does work yes it might seem faster but it's probably because you came from to seven welcome to 2014 by the way to illustrate that there is no difference in synchronous code and this is the samox script and removing async IO permit us and using just synchronous code and you can see just looking at the consumer and there's no real difference other than a couple of the weights and when you when we run it it's pretty much the same and the only difference is actually the randomness part so part of the problem could be that documentation and tutorial writers are presuming knowledge and the ability to extrapolate oversimplified examples but it's mainly because concurrency is just a difficult paradigm to grasp in general we write our code as we read anything left to right top to bottom most of us are not are just not used to multitasking and contact switching that are modern computers allow us well even if we are familiar with concurrent programming understanding a concurrent system is just hard but we're not in over our heads yet and we can still make this simulated chaos monkey service actually concurrent and in a rather simple way so to reiterate our goal here we want to build an event-driven service that consumes from pub/sub process the messages as they come in we could get like thousands of messages a second so as we get a message we shouldn't block at the handling of the message the next message we'd receive so to help facilitate this we will also need to build a service that actually lands forever we're not going to have a preset number of messages we'll need to react whenever we're told to restart an instance and so the triggering event to publish a restart requests could be an on demand request from a service owner or it could be a gradually scheduled gradually rolling restart the fleet you don't know so um well first mock the publisher to always be publishing restart message requests and therefore never indicate that it's done this also means that we provide not providing a set number of messages to publish so how do we work this function here I'm just adding on the creation of unique ID for each message that's produced so when running it it like happily produces messages but you might notice that there is that keyboard interrupt exception triggered by the control scene and we don't actually catch that so we can quickly clean that up and this is just a band-aid and I'll explain that further on but now we see something much cleaner so it's probably hard to see why it's concurrent right now so to help we're gonna add multiple producers to see the concurrency for that published function I'm gonna add a publisher ID and have it in our log messages and then create three publishers just real quick and then when we run we can see that we have a bunch of publishers going on and concurrently so for the rest of the walkthrough I'm going to remove all those multiple publishers I don't want to confuse anything now on to the consumer bit so for this goal is to come you're constantly consuming messages from a queue and to create non-blocking work based off a nearly consumed message in this case to restart an instance now the tricky part is that the consumer needs to be written in a way that the consumption of the message from the queue is separate from the work that that happens for that message so in other words we have to simulate being event-driven by reacting our but by regularly pulling messages from a queue since there's no way to trigger work based off of a new message available in that queue there's no way to be a push so um we'll first Mach restart work that needs to happen whenever we consume a message and we'll stick it in our well true loop and a wait for the next message on the queue and then pass it off to restart host and then we'll just add it to our loop and then when we run it we see that messages are being pulled and restarted we may want to do more than one thing per message for example we might want to store the message in a day two days for potentially replaying later as we initiate a restart of a given host so within the consume function we could just add the wait for both co-routines and you'll see that it happens just fine that both are saved and restarted but it kind of be still kind of block to the consumption of the messages and we don't necessarily need to await one co-routine after another these two tasks don't necessarily need to depend upon one another completely sites upping the potential concern for should we restart a house if we haven't saved in the database that's for another time but we can treat them as such so instead of awaiting them we can create a task to have them scheduled on the loop basically checking it over to the loop for for it to execute when it can and so now we have like restart and save not necessarily antsy rly but whenever the loop can can execute the every team as an aside sometimes you do want your work to happen serially and maybe you restart hosts that have enough time of more than
seven days or maybe you want to check a balance of an account before you debit it needing to code needing code to be serial or having steps or dependencies that doesn't mean that you can't be asynchronous the wait last restart date will yield to the loop but doesn't mean that the restart host will be the next thing that the loop executes it just allows other things to happen outside that co-routine and yes I admit this was a thing that wasn't immediately apparent to me at first um so we pulled the message from the queue and we found out fanned out work they saw for that message we now need to perform any finalizing work on that message so for example we might need to acknowledge the message so it's not redeliver dwell separate this out separate out the point of the message from creating work off of it and then we can make use of async ioad gather to add a callback so when we run it you can want to send so once both the Seco routine and the restart care routine are complete the cleanup will actually well be called and that signifies that the message is done however I'm a bit allergic to callbacks and as well and perhaps we need cleanup to be non-blocking so with then we can just await it now much like a Google pub said let's say that the publisher will redeliver a message um after 10 seconds if it has not yet been acknowledged and but we are able to extend that message deadline in order to do that we have to have a Co routine that in essence monitors all the other worker tasks so while we are continuing to do work this Co team will extend the message acknowledgment deadline then once we're done we should stop extending the deadline and then clean up the message so one approach is to make use of a single event primitives or we can create an event and then pass it to our extend cover teaming function and then set it when we're done and you can see that it's extending and then it stops extending when the message is actually done and if you really like events you can make use of event weight and move the clean up outside and so now we got a little bit of concurrency going on to review real quick um a single is pretty easy to use but it doesn't automatically mean they're using it correctly you can't just throw around async and await keywords around blocking code it's actually a shift in your mental paradigm both with needing to think of what work can farmed out and let us do its thing then you have to think about what depends seas are there and what where your code might still need to be sequential them having steps in your code like first a and then B and C it might seem like it's blocking when it's not it's a sequential code you can't still be asynchronous for instance I might have to call like customer service at some point but I'm gonna be on hold for a while so I can just put it on speakerphone and then go play with my super needy cat so I might be single-threaded as a person but I can definitely multitask like CPUs so earlier we added a try except finally around our main event loop code although you probably want your service to gracefully shutdown if it receives a signal of some sort like cleaning up open database connections I'm stop consuming messages and finishing responding to current requests and well not accepting new ones so if we happen to restart an instance of our own service we should clean up the mess before we exit out and so we've been catching this commonly now and a keyboard interrupt exception like many other tutorials and in libraries but there are other signals that we should be aware of the typical ones are like Zig up and say quit in term there's kill and stop it you shouldn't I catch them or block them or ignore them so if we run our current script as it is and give it a term signal where you find
ourselves not actually entering that finally clause where we like log and clean everything up so we basically gotta be aware there got to be we're aware of where those exceptions happen I also want to point out that even though we're only ever expecting keyboard interrupt it could happen outside of catching the exception potentially causing the service to end up in an incomplete or otherwise unknown State so instead of catching keyboard interrupt let's attach a signal handler to the loop and so first we'll define that shutdown behavior that we want we want to simulate database connections and returning messages to pub/sub as not acknowledged and so that they can be redeliver and and not just dropped and actually canceling tasks and we don't necessarily need to cancel all the pending tasks we could just collect them and allow them to finish something it's up to what we want to do we also might want to take this opportunity to flush any collected metrics so they're not lost so we need to put this up to our main event loop now we also I also removed the keyboard interrupt cap since it's now taken care of within the signal handling and so we can run this again
and send it the turn signal and it looks like it cleaned up but you see that we have this caught exception error twice this is because a waiting on canceled tasks will raise the async i/o cancel there which is to be expected and we can add that to our little handle exception wrapper as well so if we run it and we actually see our co-routines are being cancelled and not just some random cancel error exception so you might be run wondering which signals should you care about and apparently there is no standard basically you should be aware of how you're running your service and handle them accordingly also as a heads up another misleading API an async i/o is shield and the docs say that it means to shield a future from cancellation core deaf right here if you have a co-routine that must not be canceled during shutdown and shield will not not help you this is because the tasks that shield creates gets included in async IO dot all tasks and therefore receives cancellation signal just like the rest of the tasks to help illustrate I have a little simple async function with like a long sleep that we want to shield and
then when we run it and cancel it before the 60 seconds we don't we see that we don't ever hit the done line and that it's immediately canceled and so yeah that's funny so till they are we don't actually have nurseries to in a single-core to clean up ourselves it's up to us to be responsible and close up to the connections and files that we open responding to an outstanding requests and basically leaving things how we found them so during our cleanup in a final clause isn't enough since a signal could be sent outside of the Tri except clause as we construct a loop we should tell how it should be deconstructed as soon as possible it ensures that all of our bases are covered and we're not leaving any artifacts around and finally we need to be aware of when our program Japan should shut down which is closely tied to how we run our program if it's just a manual script then SIGINT is fine if it's a daemon demonised docker container then sig terms probably more appropriately you may have noticed that we're not actually catching exceptions within like restart host and save just on the top level so to show you what I mean we're gonna fake an error where we can't restart a certain host and so running it and we see that a host can be restarted and while the service did not crash it did save to the database and it did did but it did not clean up or ack the message and the extend on the message deadline will also keep spinning so we've effectively deadlocked on the message a simple thing to do is to add returned exceptions true within our async a go gather so rather than completely drop an exception we can turn it with our successful results however you can't really see what actually aired out so what we could do is add a call back but as I said I'm allergic so we can just add a little like helper
function to process the results afterwards and so when we use something like this we can see errors are now logged and we can handle them appropriately so the quick review and
exceptions do not crash the system unlike async any other programs and then Mike enjoy your programs and they might go unnoticed and so we need to account for that and I personally like using async IO gather because the order of the return is also deterministic but it's easy to get tripped up with it by default it will swallow exceptions and but happily continue working on other tasks and if an exception is never returned then weird behavior can happen like spinning around an event all right so I'm sure folks as as you started using I think I owe you might have realized that I think I owe or a sink a weight starts infecting the rest of your codebase everything needs to be async it's not necessarily a bad thing it's just forces a shift in perspective and so for a code to work with this we need
to swim maybe rework our consumer not much is needed actually I'm still making use of async rest consumed pair routine to call an an on a sink consumer and using a thread pool executor to run that code as an aside there's actually a handy little package called async I Oh extras which provides a decorator for asynchronous functions where it would remove the boilerplate for you and you can just await the decorated function but sometimes third-party code and throws a wrench at you if you're lucky you'll fit you're faced with a third-party library that is multi-threaded and blocking so for example Google pub subs Python library makes use of gr PC under the hood with threading but it also blocks when opening up a subscription and it also requires an on a sync callback for when a message is received so in a typical Google fashion they have some uber-cool technologies and slightly difficult to work with libraries so this feature that they return it makes use of GRP seen for bi-directional communication and it removes the need for us to periodically pull for messages as well as manage message deadlines and so to illustrate we can use run an executor again I've made a little helper function to kick off the consumer and in the publisher and to prove that this is now non-blocking I'm going to create a little dummy co-routine to run alongside and run pubs on and we'll add the two we'll add the two cover team functions and update the main and so it's just the run method or one function that we're I'm running you can see but it's not blocking but as I said that although it'll do a lot for us there's a lot of threads in the background and like 15 threads in the background that the Google pub/sub library gives us so I'm going to reuse that something else care routine to actually periodically you get some stats on threads that's going on and I've also pre fixed our own in the thread pool executor so I can easily tell which one I created versus what Google created and when running this you can see that Google creates a lot of a lot of threads we have the main thread which is the async i/o event loop and there's five threads from us because we've given it five workers and then the rest of it is is Google and so this current thread count is like twenty two but all in all like the approach to threaded code isn't that different than non async code until you realize that you have to call asynchronous code from non async function within that thread so obviously we can't just AK once we receive it and we have to restart the required host and say the message in our database so basically you have to call asynchronous code from a non async function in a separate thread pretty embarrassing bear with me and I got like two minutes to run through this we'll use the async i/o create task and that we defined earlier but then we realized that yes of course there's no event loop running and to get a little more color um here's some log lines that yes indeed no event loop is running in the thread yeah it's I can hear people say reading read the docs line but what if we gave it the loop they're running in and it kind of works but it's deceptive we're just we're just lucky here once we share an object between the threader code and the callback and the async we've essentially shot ourselves in the foot and to show you that I've created like that global queue that the consumer will add to and then we'll read off that queue and with handle message and you see something funky now nothing is ever being consumed from that global queue and so if we add a line in that queue in that function we can see the queue size gradually increasing and I'm sure a lot of you see what's going on here we are not thread safe so let's make use of her on co-routine thread safe and see what happens yes it works so in my opinion it's not that difficult to work with synchronous code in async i/o however it is difficult to work with threads particularly with async i/o so if you must use the thread safe API is that async i/o gives you or you can just hide away and try it ignored so in essence
this talk is something that I would like to hear about a year ago so I'm speaking to pass line here but hopefully there are others that benefit from this from a use case that's not just a simple web crawler everything is up there up on that URL so hopefully it's useful to folks thank you [Applause] you