We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

High Performance Networking in Python

00:00

Formal Metadata

Title
High Performance Networking in Python
Title of Series
Part Number
27
Number of Parts
169
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Yury Selivanov - High Performance Networking in Python The talk will cover new async/await syntax in Python, asyncio library and ecosystem around it, and ways to use them for creating high performance servers. It will explain how to build custom event loops for asyncio, with an example of using the libuv library with Cython to achieve 2-3x performance boost over vanilla asyncio. ----- The talk will start with an overview of async/await syntax introduced with PEP 492 in Python 3.5. We'll go through asynchronous context managers and iteration protocols it introduces. I'll briefly explain how the feature is implemented in CPython core. Then we'll explore asyncio design. I'll briefly cover event loop, policies, transports, protocols and streams abstractions. I'll explain that event loops are pluggable, which really makes asyncio a universal framework. We'll cover libuv - a high performance networking library that drives NodeJS. I'll highlight where it's similar to asyncio and how it's different. In the final part of the talk I'll explain how to make an asyncio compatible event loop on top of libuv. I'll showcase Cython, which is an amazing tool for tasks like this. Finally, I'll share some ideas on how we can further improve the performance of asyncio and networking in Python, and what are the challenges that we will face. **Objectives:** 1. Deeper understanding of async/await in Python and why it's important. 2. Deeper understanding of asyncio architecture and protocols. 3. How to improve asyncio performance by implementing custom event loops. 4. Show that it's easy to integrate existing complex & low level libraries with Cython. 5. Some perspective on how Python may evolve wrt networking.
11
52
79
Software developerCore dumpPauli exclusion principleFamilyDirected graphGravitationData conversionVector spacePauli exclusion principleRevision controlCore dumpSoftware developerLecture/ConferenceComputer animation
Socket-SchnittstelleStreaming mediaDevice driverCellular automatonStandard deviationPhysical lawRule of inferenceData structureWritingLoop (music)CodeCommunications protocolSocket-SchnittstelleStreaming mediaDevice driverSlide ruleLecture/ConferenceComputer animation
NP-hardElectric generatorMathematical singularityComputer-assisted translationSlide ruleCoroutineSoftware frameworkVolume (thermodynamics)Set (mathematics)Computer animationLecture/Conference
Normal (geometry)Computer programmingPoint (geometry)CodeSoftware frameworkGoodness of fitView (database)Connected spaceLecture/Conference
Electric generatorMathematical singularityCoroutineElectric generatorDirected graphImplementationCodeComputer animationLecture/Conference
Data typeFunction (mathematics)System callMultiplication signCoroutineSingle-precision floating-point formatCodeElectric generatorWeightType theoryComputer animation
Data typeFunction (mathematics)System callImperative programmingUniqueness quantificationConstructor (object-oriented programming)Software frameworkLecture/ConferenceComputer animation
Data typeSystem callFunction (mathematics)Instance (computer science)View (database)Different (Kate Ryan album)Software frameworkComputer programmingSystem callLecture/ConferenceComputer animation
Instance (computer science)DemosceneProteinProxy serverSubject indexingWeightObject (grammar)DialectNumberRight angleInterior (topology)Different (Kate Ryan album)Pairwise comparisonIterationMultiplication signElectronic mailing listElectric generatorSingle-precision floating-point formatCommunications protocolContext awarenessFunctional (mathematics)Classical physicsCoroutineData managementOpcodeNormal (geometry)Type theorySystem callLecture/Conference
Software developerMereologyStandard deviationParallel computingComputer programmingSoftware frameworkCASE <Informatik>Point (geometry)Event horizonXMLComputer animationLecture/Conference
Software developerMereologyStandard deviationImplementationCommunications protocolInstance (computer science)MereologyStandard deviationLibrary (computing)GravitationComputer animationLecture/ConferenceXML
Computer architectureDifferent (Kate Ryan album)Group actionPhysical systemLibrary (computing)Symbol tableBridging (networking)Right angleComputing platformFinite differenceMereologyCuboidSoftware bugCycle (graph theory)Software testingSoftwareConvolutionLecture/Conference
Event horizonInterface (computing)Server (computing)Task (computing)Queue (abstract data type)SynchronizationCommunications protocolTransportation theory (mathematics)Different (Kate Ryan album)Streaming mediaFactory (trading post)Computer programmingEvent horizonConnected spaceLoop (music)Server (computing)ImplementationPrimitive (album)Computer animation
Task (computing)Event horizonInterface (computing)Server (computing)Queue (abstract data type)SynchronizationMultiplication signBoilerplate (text)ImplementationServer (computing)DivisorConnected spaceTask (computing)CodeCoroutineSoftware frameworkVolume (thermodynamics)Lecture/ConferenceComputer animation
Interface (computing)Task (computing)Event horizonServer (computing)Queue (abstract data type)SynchronizationSingle-precision floating-point formatInterface (computing)WaveSocial classQueue (abstract data type)Software bugLecture/ConferenceComputer animation
Event horizonSoftware developerConnected spaceBlock (periodic table)Queue (abstract data type)Loop (music)Software bugInstance (computer science)Lecture/Conference
Event horizonTask (computing)Level (video gaming)Port scannerLoop (music)Multiplication signScheduling (computing)System callTask (computing)Factory (trading post)CodeMereologyProcess (computing)Execution unitComputer animationLecture/Conference
Event horizonTask (computing)Inheritance (object-oriented programming)Computer programmingSubsetForcing (mathematics)Type theoryFormal languageComputer animationLecture/Conference
Event horizonTask (computing)Inheritance (object-oriented programming)MereologySoftware testingCodeGoodness of fitEvent horizonCuboidHacker (term)Computer animationLecture/Conference
Event horizonTask (computing)Inheritance (object-oriented programming)Insertion lossMetropolitan area networkBenchmarkServer (computing)ImplementationView (database)Video gameReal numberStudent's t-testCartesian coordinate systemDistribution (mathematics)WordGSM-Software-Management AGSingle-precision floating-point formatSoftware frameworkComputing platformMultiplication signSpecial unitary groupInstance (computer science)CodeTraffic reportingXML
MultiplicationImplementationBefehlsprozessorFormal languageResultantExecution unitWindowStructural loadParallel portBitLecture/Conference
Slide ruleDifferent (Kate Ryan album)Streaming mediaCommunications protocolMereologyLecture/Conference
Level (video gaming)Streaming mediaControl flowUniformer RaumClient (computing)DataflowPlane (geometry)Event horizonSocket-SchnittstelleLoop (music)Streaming mediaServer (computing)ImplementationCommunications protocolCoroutineHeat transferModule (mathematics)Drop (liquid)Revision controlDependent and independent variablesProcess (computing)Level (video gaming)CodeMetropolitan area networkSymbol tableBitConnected spaceWordDirected graphWater vaporWeightArithmetic meanCore dumpClient (computing)Computer animation
Tape driveDataflowImplementationCommunications protocolLevel (video gaming)Buffer solutionGame controllerComputer fileLoop (music)Event horizonDataflowQuicksortPoint (geometry)Lecture/ConferenceComputer animation
Streaming mediaVector spacePhysical systemComputer fileConnected spaceEvent horizonSystem callLecture/ConferenceMeeting/Interview
Tape driveDataflowEvent horizonCommunications protocolConnected spaceGame controllerStreaming mediaLoop (music)DataflowControl flowComputer animationLecture/Conference
Streaming mediaMathematical singularityCodeConstraint (mathematics)Right angleBuffer solutionWeightString (computer science)Single-precision floating-point formatStreaming mediaCommunications protocolCodeEntire functionTransportation theory (mathematics)Reading (process)WritingComputer animation
Device driverCommunications protocolWeightFocus (optics)Right angleWordField (computer science)Arithmetic progressionRadiusStandard deviationCodeCartesian coordinate systemTransportation theory (mathematics)Level (video gaming)Lecture/Conference
Port scannerImplementationParsingMathematical singularityParsingStrategy gameHeat transferWritingReading (process)Game controllerBuffer solutionCommunications protocolLoop (music)Streaming mediaEvent horizonAbstractionForestWeightState of matterMessage passingRule of inferenceProcess (computing)Computer animation
Device driverOpen setDensity of statesFile formatBinary fileCodierung <Programmierung>ParsingPort scannerType theoryComputer networkoutputForcing (mathematics)Strategy gameWebsiteSingle-precision floating-point formatOffice suiteSource codeMereologyDevice driverDrop (liquid)Complete metric spaceLevel (video gaming)Computer virusForm (programming)Communications protocolLengthMultiplication signBuildingDampingField (computer science)Frame problemAxiom of choiceSystem callWordBinary codeLibrary (computing)Binary fileCodierung <Programmierung>Lecture/ConferenceComputer animation
Binary fileFile formatCodierung <Programmierung>ParsingType theoryPort scannerComputer networkType theoryBinary codeInstance (computer science)Compass (drafting)Complete metric spaceHypermediaDimensional analysisCAN busLecture/ConferenceComputer animation
Device driverIdeal (ethics)Type theoryServer (computing)Statement (computer science)Cache (computing)Statement (computer science)ParsingCache (computing)Data structurePointer (computer programming)Multiplication signDevice driverServer (computing)Type theoryArithmetic progressionInstance (computer science)Insertion lossForestPlanningStudent's t-testProduct (business)Computer animation
Functional (mathematics)TouchscreenMereologyNormal (geometry)Pointer (computer programming)Streaming mediaDifferent (Kate Ryan album)Device driverLibrary (computing)ImplementationSpectrum (functional analysis)Process (computing)Medical imagingArithmetic progressionFormal languageQuery languageData managementServer (computing)Interface (computing)Event horizonMultiplication signLecture/ConferenceComputer animation
ImplementationTheory of relativityCondition numberTelecommunicationMechanism designCodeLecture/Conference
Social classBridging (networking)Data typeDevice driverRead-only memoryParsingType theoryCodeLevel (video gaming)DataflowDevice driverCodeCore dumpMemory managementCommunications protocolData typePointer (computer programming)Functional (mathematics)MereologyBuffer solutionArithmetic progressionReading (process)Object (grammar)Insertion lossTouch typingSocial classLogicSemiconductor memoryCursor (computers)ImplementationDatabase transactionType theorySystem callDimensional analysisQuicksortWeightLevel (video gaming)NumberAeroelasticityRing (mathematics)CuboidSeries (mathematics)View (database)Statement (computer science)Computer animation
DataflowLevel (video gaming)CodeDevice driverAdditionImplementationLoop (music)Cartesian coordinate systemCommunications protocolCodeMathematicsReplication (computing)Loop (music)DatabaseImplementationRevision controlInjektivitätLecture/ConferenceComputer animation
Binary fileServer (computing)Client (computing)CodeUser profilePort scannerFunction (mathematics)Computer configurationCommunications protocolParsingMereologyFood energyMultiplication signCodeBinary codeLecture/ConferenceComputer animation
Right anglePhysical systemProfil (magazine)Multiplication signRoundness (object)CodeResultantLecture/Conference
User profileCodePort scannerComputer configurationFunction (mathematics)Representation (politics)FlagLine (geometry)Source codeSystem callPresentation of a groupSemiconductor memoryResidual (numerical analysis)Computer configurationComputer animationLecture/Conference
ImplementationSingle-precision floating-point formatBuffer solutionTouch typingMessage passingSemiconductor memoryObject (grammar)Instance (computer science)Level (video gaming)Computer animation
Game controllerSocket-SchnittstelleRight angleMessage passingFlagHeat transferSet (mathematics)Default (computer science)Arc (geometry)Network socketLecture/Conference
Computer configurationMereologyImplementationOverhead (computing)Level (video gaming)Dependent and independent variablesRight angleForestMereologySlide ruleCoroutineCodeImplementationCore dumpComputer animation
Instance (computer science)Connected spaceOvalSemiconductor memoryMultiplication signMereologyLevel (video gaming)Single-precision floating-point formatLocal ringConcurrency (computer science)Online helpLoop (music)Event horizonPresentation of a groupCartesian coordinate systemLecture/Conference
Transcript: English(auto-generated)
I'm from Toronto, Canada. My name is Yuri Selvanov. I'm co-founder of Magic Stack. Check out our website. It's magic.io. I'm an avid Python user since 2008. I think the first Python version I started to use actually was Python 2, but then in a month I switched to Python 3. I used it since Alpha 2 or something.
I never looked back. So use Python 3. I'm C Python Core developer since 2013, but I believe I actually started to do things even before that. You might know me from PEP 362, which I co-authored with Larry Hastings and Brett Cannon.
It's Inspect Signature API. Then I have, then I've created PEP 492. That's async await that we have in Python 3.5. And I'm also helping Guido and Victor Steiner to maintain asyncio. I also created the loop. More on that later.
Structure of the stalks. I actually wanted to tell you so much about how to write high-performance code in Python and with asyncio in particular, but unfortunately I had to cut my slides. Like, I don't know, 50% of my slides had to go.
So we'll briefly start with an overview of async await. Then we'll quickly cover asyncio and uvloop. Then we'll answer, or try to answer a question, how you should write your protocols, how you should implement them using sockets or protocols, or maybe you should use streams.
Then I'll present you with something new. It's a new high-performance driver that I open sourced like two hours ago. And then we'll recap. I have to say that there will be no funny cat slides just because performance is hard. So only seven depressed cats from now on. So let's start.
There should be just one obvious way to do it, right? So we have five different ways to do coroutines in Python. First one is to do callbacks and defers. I think Twisted actually started and originated this approach, one of the first major frameworks at least
that used that and kind of validated that it is possible. So then we have Stackless Python and Greenlets. And I'm pretty sure everybody heard about eventlet and gevent. That's a good example of frameworks that use them. In short, the programs in gevent look like normal programs.
They kind of look like you are using threads, but instead it's just one program, one thread, and every point of your code can actually suspend and then resume. It's a lot of dark magic, and as Guido said, it will never be merged in CPython, so those guys are kind of on their own.
Then we have Yield, and it was possible to use generators as coroutines in Python since I believe Python 2.5 or something, and Twisted has a decorator called inline callbacks so that you can kind of implement modern looking code
using coroutines in Twist, and you could do this for years. Then in Python 3.3, YieldFrom was introduced, and AsyncIO benefits from it. That's how most of the AsyncIO code is written, using YieldFrom. And then in Python 3.5, we have Async08.
That's the new way. And why do I think that Async08 is the answer? Well, first of all, it's a dedicated syntax for coroutines. It's concise and readable. It's easy to actually glance over a large chunk of code and see what's actually going on. You will never confuse coroutines and generators.
There is now a new built-in type of coroutines. It's actually the first time in Python history that we have a new dedicated built-in type just for coroutines. We also have new concepts, Async4 and AsyncWith, and I believe this is something rather unique to Python.
When we added Async and Await, a lot of people actually told us, well, you copied it from C-sharp. Well, yes, we copied it from C-sharp, but we also introduced new things, and I believe this kind, Async4 and AsyncWith, are kinda unique. Like, I haven't seen any other imperative language that has this construct.
Async-await is also a generic concept. A lot of people think that Async-await can only work with Async-IO. That's not true, actually. Async-IO uses Async-await, but you can build entirely new framework and use them on your own. That's, for instance, what David Beasley did with his framework called Curio.
He uses Async-await in a completely different way from how Async-await is used in Async-IO. And also, Async-await are, it's fast. If you write something like a Fibonacci calculator, you will see that it will run just twice slower. And that is fine, actually,
because even in big Async-IO programs, you don't have as much Async-await calls as you have normal function calls. You cannot even compare. It's like 100 times more. So use Async-await as much as possible. It won't hurt your performance. You won't see any drawbacks. So, coroutines are subtype of generators,
but not in a classical Pythonic sense. In Python, they share the same C struct layout. They share like 99% of the implementation, but coroutine is not an instance of a generator, actually.
And you can see this, the sharing of the machinery. If you, for example, disassemble a coroutine, you will see that it still uses yield from opcode. Then we have types coroutine. Originally, we introduced it to make old style yield from coroutines from Async-IO compatible
with new coroutines that use Async-await syntax, because you cannot just await on things. You can only await on awaitable objects. So you cannot await on number one, and you cannot await on a generator. But if you wrap a generator with types coroutine decorator,
you can await on it, actually. And again, David Beasley uses this kind of creatively in Curio. If you are interested in Async-await, I definitely recommend you to take a look at how Async-IO is implemented and how Curio is implemented, just to compare it to different approaches. And then we have a bunch of protocols for Async iterators and Async context managers.
Let's move on. Let's talk about Async-IO, Libvv, Cython, and Uvloop. So Async-IO is developed by Guido himself, originally. I think a lot of it is inspired by Twisted. And it's actually good, because Twisted existed
for, I don't know, 20 years or something, and they validated that this concept of asynchronous programming in Python actually works. So I think we copied quite a lot from Twisted, and Twisted actually plans to use Async-IO at some point when they fully migrate to Python 3. They will just use Async-IO event loop.
A lot of people call Async-IO a framework. Well, it's not a framework. I would call it a toolbox, actually. It doesn't implement HTTP, for instance, or any other high-level protocols. It just provides the machinery and APIs for you to develop this kind of stuff. If you want HTTP, you probably would use AIO HTTP for that.
If you want memcache driver, you go and Google it. And it's also part of standard library, which is both good and bad. Why is it bad? Python has slow release cadence. We see new Python major releases every year and a half,
and bug-fixed releases usually are half a year apart. And I would say that for Async-IO, sometimes it's not enough. Sometimes we discover bugs and we want to fix them as soon as possible, but we have to stick with the Python release cycle. But it's also good, because you kind of know
that Async-IO will stay with us for a while. It will be supported by someone always, because it's a part of the standard library. And also, Python has a huge network of build-bots with different architectures and different operating systems, and it's quite important, actually, to test something as convoluted and as hard as IO
on different platforms. So it's good. Async-IO is quite stable right now, and it will be even more stable pretty soon. So what's inside Async-IO? So we have standardized and pluggable event loop.
Actually, Async-IO from the beginning was envisioned in a way that you can swap the event loop implementation with something different. It defines protocols and transports. That's one way to actually marry callback style programming and Async-await,
is to actually develop protocols using low-level primitives, such as protocols. It also has factories for servers and connections and streams. And this is also quite important, because if you implement a server, let's say, using blocking sockets,
you implement it once, and then you start to implement a second time, you will see that you have lots and lots of boilerplate code that kinda looks the same every time. So Async-IO takes care of that and factors out all of this implementation and convenient helpers for creating servers and creating connections. It also defines features and tasks. Tasks is something that actually runs the coroutine,
that pushes the value into coroutines, that suspends them and resumes them. In a framework-independent way, it's called Coroutine Runner, actually. And features allow you to interface with callbacks.
That's how you actually introduce Async-await into something that uses callbacks. It also has interfaces for creating and communicating with subprocesses asynchronously. It has queues. And by the way, queue is a very useful class.
You should definitely use it. It's exceptionally hard to create an asynchronous queue that supports constellations, all the stuff like that without bugs. We still fix a lot of queue bugs in 3.5.2. So queues are useful for things
like connection pools, for instance. Definitely check it out. And we also have locks, events, semaphores, everything like that, everything that nobody knows how to use, actually. And as Lukas Lang said on his talk on PyCon US a couple of months ago, if you love the locks, you can still have them in Async-IO.
So event loop is the foundation. It's the engine that actually writes, that actually executes Async-IO code. It also provides factories for tasks and features. It's also an IO multiplexer. That's the engine that actually reads the data and pushes the data to the wire.
It provides low-level APIs for scheduling callbacks, for scheduling timed events, for working with subprocesses and handling Unix signals. And the best part about it is that you can replace it. So that's what we kinda did with Uvuloop.
Uvuloop is 99.9% compatible with Async-IO. I'm not aware of any incompatibilities, but maybe there are some. As far as I know, you can drop in Uvuloop in pretty much any program and it will just work. It's written in Cython. And by the way, Cython is just amazing.
It's unfortunate that it's not as widespread and I think it's kinda underappreciated what you can do in Cython. Essentially, it's a superset of Python language. You can strictly type it and it will compile to C and you will have C-speed. You can easily achieve it with a syntax closer to Python.
So definitely check out Cython and try to use it. Uvuloop uses libuv. Libuv is something that keeps Node.js running, actually. Node.js uses libuv as its event loop. And it's actually a good thing because Node.js is super widespread
and it's very, very well tested. So libuv is stable and it's fast. It also provides fast tasks and futures. So even your Async-await code runs faster on Uvuloop by about 30%. And it's also thanks to libuv and a few hacks.
It has super fast IO. So how fast is Uvloop? Well, compared to Async-Io, it's two to four times faster on simple benchmarks like Echo Server. Again, nobody probably deploys Echo Servers in real life, so as soon as you add more Python code,
of course it will become slower. But again, even in real applications, I've seen reports that Uvloop runs code about 30% faster. And also, the latency distribution is much better with Uvloop. So it's faster than Async-Io.
What about other platforms and framers? For instance, the same Echo Server written in Python which uses Uvloop is two times faster than Node.js. And it's kinda interesting because Node.js is itself written mostly in V8. That's the JavaScript implementation. It uses libuv which is written in C
and there is a thin layer of JavaScript on top of it. So still, Uvloop that uses the same libuv is two times faster for almost the same amount of code. It is as fast as Go, run with Go maxprox set to one. That essentially means that Go cannot parallelize
on multiple CPU, the load. But still, it's quite an impressive result because Go is like a fully compiled language and it also has, I think, a bit more efficient implementation of IO than libuv.
Just because libuv is trying to be generic, it supports Windows, it supports Unix. Golang supports it too but in slightly different way. Anyways, and of course, it's much faster than Twisted and Tornado just because it uses a lot of it is in C, like most of Uvloop is in C.
So initially, my idea for this talk was to end with this slide, just use Uvloop. Thank you for your time, questions. But unfortunately, it's not that easy. So part three, let's talk about sockets, streams, and protocols. That's basically one obvious way to do it, episode two.
So what should you choose? Should you use coroutines like socks and dolls, sock receive, sock connect? Or should you use high level streaming API? Or maybe you should use low level protocols and transfers. Here is an echo server implemented with loop sock methods
and if you look at it closely, you will see that if you kind of drop async and await keywords, it looks like a normal blocking code that uses the socket module. So it is kind of convenient when you have lots and lots of code, an old style code, blocking code.
You can kind of easily convert it to async and await. Here are streams, here is the streams version of the echo server. It's quite high level as you see. You don't work with sockets anymore. You have reader and writer.
And here is a low level implementation of echo server using protocols. So essentially, protocol is something that the event loop just pushes the data in and protocol has a transfer to push the data back to the client, to the server. So the key method here is data received.
That's like the main method. The event loop pushes the data to the data received, then protocol can process the data and then call transfer to dot write to actually send the process data or send a response back to the caller.
For echo servers, it's quite a simple implementation but you can imagine it gets pretty hairy for more complex protocols. So, downsides. When you use low level loop dot suck methods, loop cannot buffer for you.
So you are responsible to implement the buffering on top. And you also have no flow control which without buffers doesn't make any sense. You don't need flow control but when you start implement the buffers, you won't have it. And it's quite a tricky thing to implement correctly. And another thing why you shouldn't use it
is just because the event loop has no idea what are you doing right now. Let's say you are reading some data. Okay, event loop will add your file descriptor to a selector which can be EPOL or KQ on Unix.
And essentially wait for an event and when it receives this event, it will try to read the data, push the data back to you but it will also remove the file descriptor from the selector. That's an extra system call. Because it doesn't know, will you continue reading the data
or will you write the data now or will you just stop or will you close the connection? So it cannot predict what's going on. When you use streams, and streams are by the way are using protocols, event loop just knows because you have an intent. Just keep sending the data to my data received
or to my stream and when I don't need this data, I will close the connection myself. So event loop can actually optimize for that. And flow control is kind of important. I like this picture because it illustrates that sometimes you kind of have to push back on something slow or something that you don't want to use right now.
So which API you should use? You should use loop.suck methods when you are quickly prototyping something or when you are porting some existing code but I would highly recommend you to actually stick to streams. Even for porting code,
just rewrite it in streams because streams are much easier to use. You can just say, give me exactly this amount of data or you can tell streams, read until you see slash N or something like that. And it will do it. It also implements a buffer, read and write buffers quite efficiently and you can use async away
to program the entire protocols with streams. And use protocols and transports for performance actually. If you want exceptional performance, you have to go low level. So for this talk, let's focus on protocols and transports. And again, it's kind of important
for your application code, you should always use async away. Never even touch, never think about transfers and protocols. This stuff is just for drivers. Drivers for PostgreSQL, for Memcache, for Redis, for any kind of this kind of code. High level code should never think about protocols.
Always use async away. It will be enough. So let's focus on protocols. So as we mentioned before, loop pushes data to protocols. Protocols send data back using transfers. And protocols can implement specialized read and write buffers.
They can also do flow control. They can hint the event loop through the transport.resume and pause read methods. And you have full control over how IO is performed. You call transport.write, you can pause or resume data consumption.
So you have tools to control the IO. So how to use protocol and transports? There are basically two strategies. The first one is you implement your own abstractions, your own buffering and your own stream abstractions. And a good example of that is AIA HTTP.
That's what they do. They have buffers and then streams specifically designed to handle and parse HTTP protocol. And then they just use async away. It's a fine approach. It will be slower than using callbacks and then accelerating everything in C. But it's quite good, still quite good.
So the second strategy is to actually implement the whole protocol parsing in callbacks. And then create a facade that allows you to use async in a way. And the main key reason why this might be a better strategy
and why this can offer better performance is because you can just drop Python completely. You can go low level. You can use Cython, you can use C. So part four, AsyncPG. This is something that I just open sourced a couple of hours ago.
This is right now the fastest PostgreSQL driver for AsyncIO and for Python, actually. It's two times faster than PsychoPG. It completely re-implements the protocol from ground up. It doesn't use libpq. That's the defacto library for working with PostgreSQL. So we just implemented it completely from scratch.
It uses PostgreSQL binary data format. And by the way, when you are implementing protocol and you have a choice, use text or binary, always choose binary. It's easier to read binary. This is usually just less data because the encoding is more efficient. And you can process this much faster.
Because how binary formats works, usually you have a length field that tells you how much data follows this frame and then you have another one. So you can read frames much faster. You can decode types much faster. So always choose binary. And also, not all Postgres types can be encoded in text
and actually decoded from text. So composite types, for instance. If you have a recursive composite type, it's just not possible to decode it in PsychoPG. So what we did for AsyncPG, we actually forgot about DBAPI completely. There is no DBAPI for async await,
but for instance, what AOPG does, they kinda sprinkle async and await on top of existing DBAPI. So our idea was let's build a driver that just is tailored for Postgres and uses Postgres features. And we also support all built-in Postgres types, basically.
So Postgres loves prepared statements because it doesn't need to actually parse the same query over and over again. When you prepare a statement, it has basically some structure on the server with a plan, with a parse query.
And Postgres already knows how to accept your arguments and do this kind of stuff. So we use prepared statements every time. Even when you don't explicitly create them, we have an ORU cache of prepared statements and we do that transparently for you. We also dynamically build pipelines for efficiently encoding and decoding data.
So the pipeline is essentially an array of pointers to C functions that can process the stream like with enormous speed. So it actually shows. This chart compares different Postgres drivers for different languages.
The fastest one is AsyncPG. It manages to push almost 900,000 queries to the server. The second one is AOPG. That's another driver that uses libpq which also is written in C. But unfortunately, Psychopg doesn't provide
an efficient async interface, so it's slower. And also Async, AOPG, and Psychopg, they use text data encoding, so it always will be slower. Then you will see two Go implementations and then you will see Node.js drivers which are just 10 times slower.
The funny part about this one is that Node.js PGs is an actually pure JavaScript implementation of the driver and PG native is using libpq. So somehow, a lot of JavaScript is faster than C. I have no idea how. The funny thing about this performance is that there is another library.
It's not a part of this chart because it's kind of slow. It's called PyPostgresScale. Nobody knows about it. We used it for several years and then we just created AsyncPG. Anyways, it's a pure Python implementation and it's as fast as pure JavaScript implementation. So everybody is saying that Python is slower
than JavaScript, you shouldn't use it, but we kinda saw that it's possible to write a pure Python code that's as fast as Node.js code so maybe Python isn't that slow. So AsyncPG architecture, it's basically implemented in,
the meat of it is implemented in core protocol. Core protocol is something written in siphon. It uses callbacks to process the protocol. Then we have a protocol class that just wraps core protocol and inserts some future objects into it so that you can use Async Await. And the rest of AsyncPG is just pure Python
implementation that just implements the high-level API. So how would you parse-parse progress protocol? Naive approach would be just to use Python bytes and memory views, but unfortunately, doing so will cause a lot of Python objects to be created and you will actually see how long
you spend on memory allocation. So the solution is to use Python and go to the C types and just don't even touch Python bytes and memory views. So this is a preview of read buffer. It's a bit bigger than this, it's API,
but you can see the first method is the most important, feed data. That's what protocol data received actually calls. Protocol data received has just two lines in it. The first one pushes the data to read buffer and the second one calls a function that just reads from the buffer. And this buffer is kinda tailored for progress protocol.
It has low-level read in 32, in 16, and the second most important call here is try read bytes. Try read bytes either returns you a low-level CC data type or it returns you a null pointer. And if it returns a null pointer,
then you actually call read, which returns you a Python object which is much slower. But most of the time, 99% of the time, try read bytes succeeds and we can avoid creating any Python objects. So again, high-level logic of async PGA is built in pure Python.
That is how you can actually use it. You can see it's a pretty high-level API. We prepare a statement, we enter a transaction with async with, and we iterate over a scrollable cursor. Part five, let's recap. So don't be afraid of protocols. Use them to implement really, really high-performance
drivers and use Python for low-level code. It's really much easier to code inside than in C. You can quickly refactor your code completely, change everything, and it will just work. Async-await should always be used in your application code. Don't think about protocols and transport.
Use only high-level code. And again, once you have fast database drivers, memcache drivers, stuff like that, and you use uvloop, you will see your application being much, much faster. Loop create future was introduced in Python 3.5.2, actually. That's a new feature. With this, if you use loop create future,
uvloop can actually inject fast future implementation into your code because uvloop implements its own version of the future. And it's about 30% faster than async-await future. Always use binary protocols. Never, never even try to parse text protocols. So it doesn't make any sense.
If you can't do binary, go binary. Always profile your code. It's actually funny because when Async-PG actually started to work, I benchmarked it against AIO-PG, and it was two times slower. And I didn't understand why, because it should be faster.
There is no way it can be slower. So I spent about 30 hours without sleep optimizing Async-PG and made it two times faster. No, four times faster. So the important lesson from this is that if that first run showed that Async-PG was like 30% faster than AIO-PG, maybe I wouldn't spend so much time
trying to optimize it. So always profile, always analyze, and then try to push it forward. And by the way, Cython code can be profiled with wall grind, and you can visualize results in KK's grind. It's a very useful tool. Check it out. And Cython has a useful flag.
It's called dash a. It generates HTML representation of your source file, and each line is highlighted. It's either blank or it's a shade of yellow, and the most yellow lines use more Python C API. And it is slow, so basically you have a quick way of analyzing your Cython code, its speed.
So definitely check out that option. Always try to do zero-copy. Try to avoid working with bytes, memory views, all this kind of stuff. Go low-level with Cython, and don't copy Python objects, never. And one of the last advices, actually,
is to implement an efficient buffer to write data. For instance, for Async-PG, what we do for writing messages, we have a write buffer that just pre-allocates a portion of memory, and then we compose messages with high-level API, and we don't touch that memory at all,
and when the message is ready, we just send it. So we have high-level API of creating the message, but we don't allocate any memory while we are doing so. So when you have this control, you should definitely set TCP no-delay flag. We probably will set it by default
in Async-IO in Python 3.6. Right now, it's not set. You should do it, because it will speed up transfer.write method. Basically, with this flag set on the socket, socket doesn't wait until it receives a TCP arc message. It just sends the data as soon as you do it.
But if you don't have control over how frequently you are calling transfer.write, you can basically use TCP cork. What you do, you cork the channel, then you do multiple writes to it, then you uncork it, and it just sends all of your data in as few TCP packets as possible.
And the last slide is timeouts. Always implement timeouts as part of your API. Don't ask your users to use Async-IO wait for, because wait for is slow. It wraps the core routine into a task, and that comes with a huge penalty.
Your code will become 30% slower if you use wait for. So design timeouts as part of your API. At the lower level, implement timeouts with a loop.callLater method, and it will just work. That's it. Thank you.
Yeah, so I think we have time for maybe one or two questions. Hi, thank you for the presentation.
I want to ask you about using Async-IO and UB, your event loop, not for high performance, but for high concurrency. Do you have any, would you use it for high concurrency?
Of course. I have a scenario with hundreds of thousands of concurrent connections, but. Yes, UB loop is even better for that, because it uses less memory than Async-IO. No, here, you. UB loop is much better for highly concurrent application
that handles hundreds of thousands of connections, simply because it uses less memory. Again, it's faster. We tested UB loop with 100,000 connections, and it handles, it's pretty okay. Thank you.
Unfortunately, this is all the time we have. Thank you.