We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Opening PyPy's magic black box

00:00

Formal Metadata

Title
Opening PyPy's magic black box
Subtitle
A deep dive into the JIT
Title of Series
Number of Parts
118
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
PyPy is a fast and compliant implementation of Python. In other words, it's an interpreter for the Python language that can act as a full replacement for the reference interpreter, CPython. It's optimised to enable efficient just-in-time (JIT) compilation of Python code to machine code, and has releases matching versions 2.7, and 3.6. It now also supports the main pillars of the scientific ecosystem (numpy, Cython, scipy, pandas, ...) thanks to its emulation layer for the C API of CPython. The PyPy JIT is often just described as ""magically running your code faster"", but is actually what is known as a ""meta-tracing JIT"". A tracing JIT optimises loops by recording and optimising a single, hopefully representative, execution of the loop. While crude, that approach is known to be effective for just-in-time compiler. Additionally, PyPy's JIT is ""meta"" in the sense that it traces the execution of the interpreter while it runs some user-code instead of tracing the user-code directly. This again simplifies the compiler. We will explore how all this works together and is implemented (spoiler: it's Python all the way down!). This talk assumes no prior knowledge of compiler theory nor of PyPy internals, and should be of interest to anybody who wishes that their pure-Python code would run faster. The audience will gain a firmer understanding of how PyPy operates and optimises code, and how to how to get the most out of the PyPy JIT.
Keywords
Block (periodic table)Point cloudGoogolJust-in-Time-CompilerJust-in-Time-CompilerCompilerProduct (business)Direction (geometry)SoftwareComputer animation
Just-in-Time-CompilerCore dumpSoftware developerJust-in-Time-CompilerCodeBlack box
CodeImplementationBeta functionInterface (computing)Extension (kinesiology)ArchitectureCodeParticle systemOrder (biology)Goodness of fitSoftware repositoryImplementationLibrary (computing)QuicksortNumberPoint (geometry)Revision controlConnectivity (graph theory)CompilerInterface (computing)Finite differenceFrequencyProjective planeFormal languageComplete metric spaceTerm (mathematics)Proper mapSubsetBytecodeJust-in-Time-CompilerProgramming languageEmulatorGame theoryWebsiteSource codeExtension (kinesiology)Flow separationOrder of magnitudeChainUltraviolet photoelectron spectroscopyLecture/ConferenceComputer animation
Just-in-Time-CompilerCodeTranslation (relic)InferenceObject (grammar)Social classFunction (mathematics)Graph (mathematics)Control flowExecution unitTotal S.A.Demo (music)Control flow graphDiagramInformationCodeAdditionLevel (video gaming)MereologyGraph (mathematics)Right angleInterpreter (computing)Operator (mathematics)Variety (linguistics)CompilerPhysical quantityObject (grammar)Multiplication signTypinferenzBytecodeLibrary (computing)Order (biology)Source codeJust-in-Time-CompilerBitMachine codeSoftware developerAssembly languageData conversionAsynchronous Transfer ModeMetreExecution unitChainBenchmarkSlide ruleRun time (program lifecycle phase)SpeicherbereinigungMathematicsWave packetRepresentation (politics)2 (number)
Drill commandsMotion blurTotal S.A.Uniform resource nameMaß <Mathematik>OvalComputer fileCurve fittingLie groupComputer programCompilation albumData centerBlock (periodic table)Hydraulic jumpObject-oriented analysis and designAbstract syntax treeAbsolute valueCone penetration testSource codeIterationBytecodeBenchmarkMultiplication signControl flowStability theoryLatent heatCode2 (number)ResultantPower (physics)Interpreter (computing)MereologyFunctional (mathematics)Formal languageAdditionRaw image formatLevel (video gaming)Type theoryOperator (mathematics)Computer programmingComputer fileEqualiser (mathematics)Stack (abstract data type)Just-in-Time-CompilerSource code
Compact spaceSpacetimeFlagSequenceInstallable File SystemSoftware engineeringLaceHash functionHecke operatorInclusion mapSicInheritance (object-oriented programming)Right angleData typeError messageAbstractionPareto distributionJust-in-Time-CompilerCodeAdditionJust-in-Time-CompilerObject (grammar)Source codeCASE <Informatik>ResultantImplementationReal numberParameter (computer programming)MetaprogrammierungCalculationInterpreter (computing)BytecodeLogicType theoryComputer programmingMetre
CodeJust-in-Time-CompilerMultiplication signBit rateLine (geometry)Lecture/Conference
Pareto distributionJust-in-Time-CompilerCodeMereologyCASE <Informatik>Branch (computer science)CodeBytecodeAdditionComputer animationLecture/Conference
Pareto distributionCompilerControl flowInformationJust-in-Time-CompilerCodeIterationInterpreter (computing)AdditionLogicCompilerExterior algebraJust-in-Time-CompilerCodeRun time (program lifecycle phase)Computer programmingControl flowMultiplication signMereologyInformationInclusion mapIterationOperator (mathematics)CASE <Informatik>Tracing (software)Branch (computer science)Computer animation
Pareto distributionJust-in-Time-CompilerControl flowCompilerInformationIterationInterpreter (computing)CodeCondition numberLevel (video gaming)Interpreter (computing)BytecodeImplementationRow (database)SynchronizationJust-in-Time-CompilerSound effectMusical ensembleLecture/ConferenceMeeting/InterviewComputer animation
Just-in-Time-CompilerDevice driverCodeGraph (mathematics)Control flowInterpreter (computing)Tracing (software)Meta elementOperations researchSteady state (chemistry)Logical constantComputer programException handlingNormal (geometry)Run time (program lifecycle phase)Conditional probabilityCompilerBridging (networking)StatisticsClient (computing)InformationServer (computing)BlogCodeInterpreter (computing)QuicksortControl flowGodBytecodeCheat <Computerspiel>Functional (mathematics)Wave packetFunctional programmingDevice driverProfil (magazine)PiRepresentation (politics)Condition numberMusical ensembleFile formatImplementationTracing (software)ChainObject (grammar)Different (Kate Ryan album)Row (database)Attribute grammarFerry CorstenOperator (mathematics)MathematicsJust-in-Time-CompilerMathematical optimizationMachine codeLecture/ConferenceComputer animation
Line (geometry)MalwareRule of inferenceHydraulic jumpMass flow rateDifferential algebraic equationQuadrilateralSicAbsolute valueObject-oriented programmingQuantumSCSIObject-oriented analysis and designOpen setSupremumAngleOpen Archives InitiativeInformationPoint cloudComputer configurationServer (computing)Just-in-Time-CompilerElectronic mailing list
Drum memoryFunction (mathematics)Control flowLoginGradientRow (database)Operator (mathematics)Just-in-Time-CompilerObject (grammar)Type theoryCodeRight angle
MalwareHydraulic jumpReduction of orderString (computer science)Invariant (mathematics)Control flowIterationPower (physics)ArmResource allocationLinear mapMilitary operationDemo (music)Presentation of a groupMathematical optimizationMultiplication signControl flowIterationFront and back endsComputer architectureInvariant (mathematics)Order (biology)CompilerOperator (mathematics)Subject indexingNumberJust-in-Time-CompilerSequenceClassical physicsObject (grammar)Row (database)Resource allocationVirtualizationInstance (computer science)2 (number)Computer animation
Hydraulic jumpMalwareObject-oriented analysis and designDemo (music)GradientNormed vector spaceToken ringControl flowExecution unitCountingLoginResultantControl flowMathematical optimizationAssembly languageType theoryFunctional (mathematics)Finite differenceSystem call
Generic programmingSoftware frameworkJust-in-Time-CompilerInterpreter (computing)AbstractionFreewareMassInterpreter (computing)Sound effectOrder (biology)Overhead (computing)Multiplication signFreewareSoftware frameworkDifferent (Kate Ryan album)Just-in-Time-CompilerOnline helpBenchmarkOperator (mathematics)Formal languageDynamical systemChainComputer animation
Control flowPattern languageOverhead (computing)Machine codeCanonical ensembleCASE <Informatik>Lecture/Conference
Object (grammar)Type theoryMathematical optimizationInstance (computer science)CASE <Informatik>Data dictionarySoftware testingJust-in-Time-CompilerMechanism designOnline helpOverhead (computing)Attribute grammarMultiplicationDirection (geometry)Meeting/Interview
Transcript: English(auto-generated)
Thank you. So for those of you who were at a previous talk, you saw how to build a JIT just-in-time compiler from the ground up. In this talk, I guess we're going the other direction.
We will be starting from a finished product, a complicated piece of software, PyPy, and we will try to figure out how the JIT inside it works.
So about myself, I'm Renan Lamy. I've been a PyPy coder for about seven years now, and before this talk, I actually didn't know that much about the JIT.
For me, it was a magic black box, and by preparing this talk, I started to open it, and I want to show you how you can open it too. So let's talk about PyPy first, and the way I like to introduce it is from someone whose
name you might know. If you want your code magically to run faster, you should probably just use PyPy, and that's
the goal of PyPy. Run Python code faster. So nowadays, PyPy is full and fully compliant implementation of Python. We have, and we've had for several years, a complete support for Python 2.7, and we
will continue to support 2.7 for an indefinite period. We've also released a version that supports Python 3.5.
We have beta support for 3.6. It's not fully complete yet. There's still a few things in the standard library to implement, but we should have
a full release of 3.6 in the not too distant future. I don't want to give a date, but this year, definitely. Well, of course, the main draw to PyPy is that it can be much faster than CPython when
running your Python code. I don't like to give exact numbers because it depends a lot on the kind of code you're running.
Some code can't really be improved above what CPython does, but for other sorts of code, you can have huge speed ups, and it can be a game changer in terms of order
of magnitude, you get from Python speed to more or less C speed. And of course, that is all due to the JIT, which we'll talk about in a few minutes.
But I just want to remind you that pure Python code is not the only thing we care about when we use Python. So PyPy also has a good story for C code and all sorts of programming languages that
speak, that have the same interfaces as C. So the best way to talk to a compiled language from PyPy is to use CFFI.
It's convenient in general for the whole Python world, also on CPython, but on PyPy, CFFI is particularly well optimized and works well with the JIT. But of course, in the Python world, we can't get by without supporting all the
C extensions, so things like NumPy, scikit-learn, Cython. So PyPy has an emulation layer to support all these extensions.
It's annoying for us because we could run code faster if it was written in Python instead of being written in C. But anyway, we have the compatibility, there's a site
where you can get binary wheels, they are not yet on PyPy, so that's why you have to use this repo for now. And with that, I think PyPy offers very good compatibility for everything you'd like
to do in Python. But that's not the main point of my talk, I'd like to show you the internals of PyPy and specifically of the JIT.
So let's talk about the internals of PyPy first, and I'll start by comparing it with C Python, which most of you are probably more familiar with.
So in C Python, well, it's written in C, as the name indicates, and once you compile the sources, you get an executable that has two main components. First you have the compiler that takes your Python code and transforms it into byte code,
and the interpreter proper that actually runs the byte code. In PyPy, well, in PyPy it's quite similar, the first difference is that the language
is different, PyPy is implemented in something called R Python, which is a subset of Python 2, and with the R Python toolchain, that is something that was built by the PyPy project
in order to transform the source code into the PyPy executable. So thanks to this toolchain, you get the PyPy executable, which has the same byte
code compiler as C Python, which has byte code interpreter, which is quite similar to C Python, but it also has the JIT, which is the part on the right on this diagram.
So what the JIT does is that at runtime, it uses runtime information to optimize the
code that the interpreter is currently running, and it produces machine code which is a lot more efficient than interpreting byte codes one by one, and it can switch back and forth
between the interpreted and the assembler running mode. So, let's talk a bit about the toolchain.
So the start is R Python code, then the object, the code is imported at the Python level by the toolchain, and then the toolchain analyzes this Python 2 code, does type inference
and a variety of things to reach a stage where the whole code for the interpreter
is represented as control flow graphs. So it's a graph of all the operations that happen inside the interpreter.
And then from that representation, the toolchain adds the garbage collector, adds the JIT, and then transforms it into a C code which is compiled, and at the end you get the
PyPy executable. So, I tend to explain this quite often. And then the next part of the conversation goes a bit like this.
What about the JIT? Well, it's just complicated, it's magic. But someone who's actually a PyPy code developer recently said,
if you spend enough time with it, any magic is just careful and clever putting of bits together. So, let's just spend some time with the JIT, and it won't seem as scary anymore.
So, to do that, we have to run some code, and I have a somewhat stupid example, but it has to fit on a slide, so it's going to be that complicated.
And the idea is that you have a library for working with physical quantities, and a physical quantity has a value, and you have a unit. And then when you do operations with these quantities, you have to look at the unit,
and then do the actual operation. So, for simplicity, we only implement addition here, and we just check if it's the same unit, because you don't want to add meters and seconds,
and we don't want to bother about weird things like feet and yards. And if we have the same unit, we just return a new object that represents the addition of the two. And since we were interested in performance,
and what the JIT can do with such code, we have a simple and rather stupid benchmark, where we just add these quantity objects repeatedly for 500 million times.
So, that's quite a lot of operations, but let's see what PyPy can do with it,
and what the JIT can do with it. So, here I'm in a PyPy3 virtualenv, so I just run time python pyfile, and it takes less than a second.
Well, that's a very, very crude benchmark, so let's run it twice to see if it's stable. Well, it apparently is. Well, let's compare with CPython to get a feel.
CPython shouldn't be too long now.
Alright, well, 12 seconds, okay. So, well, 12 times faster, that looks decent. But actually I cheated. In reality, the code has a slight addition to what I just showed you.
So, on CPython, you can see down at the back, at the bottom, yes, I'm running 100 times less iterations on CPython,
because I didn't want to wait for the whole talk for it to return. So, on that specific example, PyPy is actually more than a thousand times faster than CPython. And, well, it looks quite magical,
but I'll remind you that this is still just, when you're running on PyPy, it is just like regular Python, you can interrupt it in the middle and you'll get a trace back.
You can, well, you can even run it under PDB. Oops, that's not what I wanted to do. I want to run, I want to continue.
Yes, and here I've started running the program under PDB and I've interrupted it, so now I can inspect. So, inspect the value.
So, we are at the 418,053,951th iteration of the loop and we can still do whatever we want with the PDB.
Despite having all the power of Python, this runs in just one second for 500 million iterations. So, well, let's try to figure out how this happens.
And the first thing, the first level we need to look at is the same as you would do in CPython. We should first look at the bytecode
because that's what the interpreter actually sees. It doesn't see the whole source code, it sees the bytecode. So, let's import our code.
And to inspect bytecode, there's a very simple tool which is this. Look at the compute function. We can see bytecode.
I won't be talking about bytecode too much. I hope most of you have at least seen something like that before. But in this bytecode, you can see the interesting part
which I'll try to highlight. So, this which I've highlighted is the code for the loop and it's actually pretty simple and you've seen in the source code it's just a plus equal.
So, in bytecode, this turns into mostly the in-place add bytecode with a few operations before and after to put things on the stack.
The bytecode language is a stack-based language. So, you put things on the stack and then every operation pops things from the stack and puts the result back on the stack. So, let's now have a look at this in-place add bytecode.
And this is the slightly simplified source code
for the in-place add bytecode inside PyPy. The real thing does exactly the same thing. It just has more meta-programming that obscures the things. And since the implementation for PyPy is a Python
then it looks like Python 2 so we can easily read it. And what works is, first it takes values from the stack as I explained and then it does the calculation
and at the end it pushes the result. So, what is the actual calculation that happens inside the interpreter when you do a plus equal? Well, first you look up dunder I add on the object,
on the left object. And then there's something complicated which doesn't happen in our case because we didn't implement dunder I add. So, then since we don't have dunder I add
we fall back to doing the same thing as simple addition. So, then addition proper is more complicated because you have to check the types of both arguments.
Then you look up dunder add on the left argument and then in certain cases you will look up dunder R add on the right argument. But as it happens, in our case, our two objects are of the same type,
they are of this quantity type. So, we end up here and then we just call this dunder add method
which we've looked up on the type. So, now if we want to go deeper we have to actually talk about the JIT
because this is pretty much the same logic as in CPython.
So, the magic happens in the JIT and so let's talk about first about how the PyPy JIT was designed. The JIT in PyPy is what's called a tracing JIT.
The ideas behind a tracing JIT is that first, well I guess it's the idea behind all JITs,
you consider that most of the time in your code is spent in very few lines. So, if you compile just in time,
you can compile only a small part of your code and get larger performance benefits. And the other important principle is that when you have conditional in your code, in most cases you pretty much only ever take one of the branches.
Like as I showed you in the in-place add bytecode, you need to look up for the dunder I add method
but we didn't implement it in our user code so this check will always fail and we will always fall back to doing just an addition and not the special in-place addition logic.
So, the idea is that we should, first we compile only the hot loops, the parts of the code where we see that the program that is currently running is spending time.
And then we should also optimize for the fast path, so try to, we can assume that in most cases we only have to consider one branch of an alternative.
And of course the one thing that gives an advantage to just-in-time compiler over head-of-time compilers is that with just-in-time compiler,
you know what you are running on, you are at runtime, you have seen already the user code. So, it's easier to optimize for what the user wants to run.
And therefore the JIT works, well it traces the code. So, to compile some optimized code, the first thing is to trace one iteration of a loop
and to record all the operations that were made in that iteration of the loop.
And that way you can include runtime information in that trace. And once you have the trace then you can optimize it.
But you also need to add what is called guards to check that the conditions under which the trace was recorded are still valid. And finally, an important idea in PyPy is that because Python is so complicated,
because one bytecode like in-place-add can do so many things, it would be very hard to trace at the Python level.
So, to simplify the implementation in PyPy, we want to trace what the interpreter does. We want to have a record of that implementation I showed you and try to run that.
And as a side effect, doing that allows the JIT to stay in sync with the interpreter.
So, the way to record what the interpreter does is called JIT code. So, to create those JIT code, the first thing is that the implementation of PyPy
contains hints that tell the toolchain where things can be jitted, where you have an opportunity for optimizing loops. So, that's called JIT driver and the main one, of course, is on the main bytecode dispatch loop
so that you know that when you jump back to the beginning of a loop, you know that you can start JITing under certain conditions.
And then the toolchain just follows all the code that is reachable from these JIT drivers. There are hints, like decorators, like don't look inside, that can modify the lookup of the code.
And then you have things like Elidable, which allow certain optimizations
by telling the JIT that some function is a pure function in the sense of functional programming. It's referentially transparent. You know what that means. And therefore you don't need to run it again.
And you can also declare some attributes of certain objects as quasi-immutable so that it means that the JIT will assume they are always constant,
but they can change, but the JIT assumes they don't change. So, with all that, the toolchain just converts the internal representation of the interpreter
to this format that is optimised for size.
Then, when tracing, what the JIT does is that it uses the JIT codes and interprets them, actually.
So, during tracing, the PyPy interpreter is running on top of another weird interpreter that runs these JIT codes and records all its operations.
So, while it does that, it records guards. So, the guards are usually simple checks, but they need to... There are the conditions that, when the guard fails, there's a slight complication
because the code needs to exit the optimised path and fall back to the interpreter.
And there's also a different sort of guard where you assume that something is true and, if it isn't, then the whole trace is thrown away. So, you don't need to check the guard when you execute the trace.
So, that's very efficient. And I'd like to show you what those traces look like. So, for that, I'll use VMprof.
It's a statistical profiler, but most importantly, today, it can show the traces. And to use it, you have this VMprof that you can install
and it records profiling information and JIT information, which you can then visualise on a server.
Here, I'm running locally. There's also an option to run it in the cloud. And if I open this... Yes.
So, here's what the JIT records when running our code. And you can see here in the in-place ad, there's a whole lot of operations.
But you can see that there is this quasi-immute operation right here, which comes from a lookup on the type object.
So, I guess that one is the lookup of the iAdd method. On the object. And it's recorded as quasi-immutable.
So, that means the JIT won't have to... Well, in the end, not have to worry about it. So, let's go back to the presentation. So, once we have traced the raw operations,
there's important optimisations that allow... that reduce the number of operations, because that was a huge lot of operations. So, you have classic compiler optimisations.
There is a check for the values of ints, in order to remove, for instance, index checks on array access.
It removes the guards that are useless or are implied by other guards. And there is an important... The most important optimisation is virtualisation. So, when objects don't escape the loop,
or when they don't usually escape the loop, then they don't need to be created at all. And they will only be created on demand.
And that way, you remove allocations, which are very expensive operations. And another important optimisation is unrolling. Which, where you...
Instead of optimising one loop, you first run one iteration of the loop. And then, very often, in loops, you have things that are always the same.
You have, basically, loop invariants. So, by running this first iteration, you compute all these things that stay constant. And then, you can have a second iteration that doesn't need to repeat these loop invariant operations.
And that second iteration is what is actually repeated all the time. And after that, the trace and this sequence of operations is passed on to different backends.
So, every architecture needs a different backend. And, well, that's where you actually emit the assembly. And it's relatively simple compared to the rest of the JIT.
And in the end, you just have, for each operation, a simple assembly to emit.
Let's have a look at the final result. So, first, after the optimisation, here we are...
Sorry, wrong one. After the optimisation, without unrolling in this in-place add, you see that there are still a lot of different operations.
But after all the optimisations, after unrolling, this in-place add has removed all the checks on the different types.
It has removed the function call. And, well, the only thing that is left is, I think, incrementing the loop counter. And you can see the final assembly that is generated.
It fits right here. It's only about 10 assembly instructions. So, that's why it is very fast.
So, let's conclude. First thing is that on this benchmark, the JIT is quite unreasonably effective.
It somewhat fortunately manages to remove all the operations, which makes a big difference over what CPython has to do. But more generally, the way the JIT works is that
the toolchain contains a generic framework that is pretty much Python agnostic. And in order to get the massive performance benefits you've seen,
the interpreter needs to exploit the features of the toolchain. And together, they give you abstractions for free. They remove most of the overhead that comes from using dynamic language like Python.
So, if you want to know more, you can contact us on IRC. We like IRC. Just an announcement.
We'll have a help desk tomorrow morning and we'll be at a sprint, so come talk to us. And I hope we have time for a couple of questions. Thank you. Thank you. So, we have time for one, maybe two questions.
Anybody? Please come to the microphones. Hey, thank you very much for the talk. So, I'm going to ask some kind of a mean question. So, in practice, sometimes I've seen PyPy behave slower than CPython.
And so, we're often shown this kind of canonical example of a for loop, where PyPy does so much better than Python. And that's great. But I think it would be useful also if you could point to some, I don't know, pathological coding patterns in which PyPy would perform slower because of some overhead that are present that are not present in CPython.
Well, it's always hard to really understand the bad cases. But basically, the bad cases tend to be when the JIT is unable to remove the overhead.
So, here, for instance, if you have dictionary access, that is somewhat slow.
And here, the performance was very good because all the dictionary access could be removed, because there's this quasi-multiple mechanism for type objects.
I didn't talk about it, but there's an optimization for instance attributes as well. But sometimes you disable that, and then the performance suffers.
That's very interesting, because actually my test case implies a lot of direct dictionary access. So maybe it's very difficult. Well, every case is a bit different, so I can't really give a general answer. Okay, so if you have any more questions, please find Ronan and the PyPy guys at their help desk or at the sprints.
And thank you again.