We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

How to write a JIT compiler in 30 minutes

00:00

Formal Metadata

Title
How to write a JIT compiler in 30 minutes
Title of Series
Number of Parts
118
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Real-world JIT compilers like PyPy and Numba are complex and advanced. However, the basic ideas behind JIT compilers are easy to understand, as this talk aim to show. This is a live-coding exercise: we will start from a blank page and write a working (albeit simple and limited) JIT compiler from scratch.
Keywords
Just-in-Time-CompilerCompilerPoint cloudGoogolSoftwareJust-in-Time-CompilerMultiplication signCore dumpOpen sourceSoftware developerComputer animationMeeting/InterviewJSON
DampingSubsetType theorySemiconductor memorySoftware testingPointer (computer programming)Complete metric spaceFunctional (mathematics)Multiplication signLibrary (computing)Point (geometry)Codierung <Programmierung>Social classComputer programmingMereologyBuffer solutionBitVariable (mathematics)Assembly languageCodeOrder (biology)CompilerCore dumpLogicObject (grammar)Just-in-Time-CompilerTask (computing)NP-hardKernel (computing)Computer architecture
Convex hullDynamic random-access memoryAssembly languageCodeMultitier architectureFunctional (mathematics)CodeCompilerType theoryPointer (computer programming)Right angleBuffer solution
Functional (mathematics)Social classPoint (geometry)MappingAbstract syntax treeCompilerCodeError messageReal numberDefault (computer science)CASE <Informatik>Network topologyNumberMultiplication signResource allocationLogicSemiconductor memoryJust-in-Time-CompilerBefehlsprozessorUniform resource locatorFormal languageSource codeSoftware testingComplex (psychology)Variable (mathematics)ExistenceData dictionaryCode
Network topologyFunction (mathematics)CodeMathematicsPattern languageSource codeNetwork topologySocial classError messageAssembly language3 (number)Functional (mathematics)Buffer solutionModule (mathematics)CompilerJSON
Module (mathematics)Attribute grammarStatement (computer science)Assembly languageLibrary (computing)Complete metric spaceCodeParameter (computer programming)NumberUniform resource locatorCASE <Informatik>ExpressionFunctional (mathematics)3 (number)Software testingPattern languageBinary codeRecursionNetwork topologyRepresentation (politics)Software frameworkError messageSource codeOrder (biology)ImplementationDefault (computer science)TorusSinc functionObject (grammar)Data structureWritingMultiplication signBuffer solutionStack (abstract data type)Resource allocationJust-in-Time-CompilerLogical constantReal number
Rule of inferenceNetwork topologyFunctional (mathematics)Operator (mathematics)Optical disc driveSoftware testingRun time (program lifecycle phase)Parameter (computer programming)NumberRecursionComplex (psychology)Binary codeIntegerExpressionMultiplication signOpen setCASE <Informatik>Sign (mathematics)Right angleSocial classLengthAttribute grammarResultantStatement (computer science)AdditionTerm (mathematics)Stack (abstract data type)LogicCodeResource allocationUniform resource locatorReliefTorusBinary file
Operator (mathematics)Variable (mathematics)Just-in-Time-CompilerDigitizingSoftware testingHydraulic jumpCondition numberFunctional (mathematics)Source codeCASE <Informatik>Real numberPairwise comparisonParameter (computer programming)LogicCodeExpressionPoint (geometry)Error messageMultiplication signBitCompilerNetwork topologyCausalityDot productLevel (video gaming)TorusEqualiser (mathematics)BenchmarkTable (information)Server (computing)Computer programmingElectronic mailing listUsabilityArithmetic meanPhysical systemAttribute grammarWeightLoop (music)Data compression
Software repositoryJust-in-Time-CompilerElectronic mailing listSoftware repositoryCodeStatement (computer science)Operator (mathematics)Game theoryElectronic mailing listLine (geometry)Formal languageComputer animation
Statement (computer science)Lecture/ConferenceMeeting/Interview
Functional (mathematics)Data conversionStatement (computer science)Reading (process)ExpressionAssembly language
Artificial lifeMusical ensembleDampingVariable (mathematics)Point (geometry)Buffer overflowLecture/ConferenceMeeting/Interview
Social classNumberBuffer overflowPoint (geometry)Variable (mathematics)Lecture/ConferenceMeeting/Interview
RiflingAssembly languageCodePiVirtual machineFront and back endsLevel (video gaming)Just-in-Time-CompilerPowerPCProcess (computing)WeightIntermediate languageBefehlsprozessorExpert systemReal numberBytecodeVariable (mathematics)Projective planePoint (geometry)TorusArmLecture/ConferenceMeeting/Interview
Transcript: English(auto-generated)
Hello everybody. I will introduce me quickly because we don't have much time to write a JIT compiler, so let's start very soon. I am Antonio Cuney, I am a PyPy core developer since 2006.
I also do some other open source stuff. So the scope of this talk is to show that JIT compilers are not so hard as you could think at the first glance, and so we are going to write a very, very simple JIT compiler for x86 64-bit.
We are going to compile a subset of Python and we are going to do some assumption to make it easier, like all the variables will be of type float and we won't compile all the instruction at all, just a very simple subset, but it's enough to show how things work.
We are using a library to encode assembly instruction because, well, for Intel architecture it's a hard task and that's it. And the disclaimer is that before writing this talk I never wrote any assembler code before,
so this really shows that it is kind of easy enough if you know more or less what to expect. And so since I didn't know much about assembler, what I did was to compile C programs, compile with the GCC, look at the assembler generated and basically do copy and paste.
So, for example, you start with this very simple function to add two double numbers and return them, so what I did was something like this, and then with opdump,
okay, so these are basically the bytes that you need to put in memory in order to execute this logic. And then the next thing you want to try is, well, how to execute something like this from Python.
So basically these are the bytes that I showed you before. These are really a copy and paste of the object dump thing and it corresponds to this assembler instruction, so what you need to do is to allocate some memory.
You need to allocate with mm-map because you need to tell the kernel that you want to execute this part of memory. You put the code inside a buffer and then you use, for example, CFFI to tell Python that this particular piece of memory
contains a function which takes two doubles and returns a double, for example, so let's see if it works. So you see that here we have our buff, which is a buffer, so we can, with CFFI, we can get a pointer to this memory,
we can cast the function pointer, and we get a CFFI function pointer, and then we can call it, and it works.
Okay, this is the second time I do this talk, and the first time I got a segfault at this point, so I'm improving. So back to the code, okay, so basically this is the basic idea, so we need to allocate the memory
and fill it with the right instructions and execute it, and that's how you write the JIT compiler. But then, of course, it's much more complicated than this, so let's start from tests. So I want to test that it is going to work, so I'm going to write a compiler function class which wraps all of this.
So let's see, py.test, test JIT, what is it, okay, so, okay, I don't have this class, let's write it.
So, okay, I'm going to cheat a bit because I didn't have enough time to, well, type everything live, so what we want to do is something like this, I think, so we allocate the buffer, we copy the code inside,
we cast it to the right pointer type, and when you call the compiler function,
well, you just call the CFFI function, and let's see if it works, and it doesn't.
Okay, better, so we are running some code and it's working. So the next thing we need to do for writing a JIT is to handle registers, so the modern CPUs have many registers, and when you run your compiler code,
you need to decide which values go in which registers, so we are going to write a very simple register allocator in which you assign a variable to a register, and it's so simple that basically we just assign them once,
modern real JIT compilers have to decide when to use register, when to reuse for something different,
where to put a value of register in some temporary location in memory, and et cetera, but since this is very simple code, well, we will just decide that we use as many registers as one register for each variable, and then if you use too many variables, then we just give up.
So for example, we maintain a dictionary which maps a variable to a register,
we use this default trick so that we don't have to maintain it by ourselves,
and if we no longer have register, then we just raise not implemented error because it's simple. In a real code, then you need to handle this case and to save the value of a register to some location in memory
and then all the complicated logic. And then to get the value of a register, you just return it from this dictionary.
So when the register is not in the dictionary, it goes to the default logic, allocate one and et cetera, so let's see if our tests work, you see, we are checking that the variable number A gets the first register and the second and the third and et cetera. So, okay, light coding.
Okay, so we are slowly getting to some point in which we can actually compile things because now we have a compiler function which takes care of handling the low-level stuff,
a register allocator which we will use later, and then let's try to write a real AST compiler. So I'm not going to parse the Python code because it's complex and there is no time, so I'm abusing the AST compiler and basically with AST, you take Python source code
and you get a tree which represent your instruction, I will show you some picture. Okay, and then what I want to do is to have a class in which I can pass some Python code,
compile it and get a function and then I can call. And I want to check that the empty function just returns zero. In my toy language, you have to return a number, so if you don't return anything, it will return zero. So of course, if we test it, we have an error because this class doesn't exist.
So let's cheat some more. So I'm using the visitor pattern, I will show you in a while. So basically the idea is that this class takes the source code, it's parsed with the AST module of Python,
and when I compile it, I do some magic to generate the assembler code as bytes to put them in the buffer, and then I can return the compiler function I just wrote. And the visitor pattern, okay, I will show you in a while, but basically it means that from this tree
that I will show you very soon, I will call a method of this class for every node of the tree. So if I test it, I see that I have this not implemented error for the module method, so let's write it.
Okay, so now I can use self.show to show the AST tree, and it should appear there, yes?
So this is the tree representation of our source code. So you see that you have a module which contains a function def, which is named foo, it doesn't have any argument, and it contains a statement, and the statement is pass. Okay, so the visitor pattern I am using here, basically it says that, okay, we have a module,
and then the module node has an attribute which is body which contains all the statements.
So what I want to do is for child in node.body, self.visit child. And this is going to recursively visit all the nodes, and the idea is that when you visit a node,
you emit the assembly code for the statement you are visiting now. So now I expect that if I run it, I will get an error because function def doesn't exist, and indeed.
So let's implement function def. Function def self node, so the arg names of the function are something like arg.arg for arg in node.arg.abs, something like this.
So I create a new function, I will show you the implementation of this in a while,
and then the important thing is again to recurse into the children of this guy and to visit them. And since we need to return something because in assembler we can't return none,
we return zero by default, and we return it using the ASM object which I will create shortly. So the assembler usual way of returning zero is to store a value with itself, so we store the XMM zero register with itself, and we return it.
And then of course, now we need the new function, and what it does is to create the ASM thing,
which is the function assembler, which is something that I didn't show, but basically it makes it possible to just write nice mnemonics as method instead of having to put the byte in the buffer.
It is the pitch by library I was telling you about. Then we need to allocate to create our allocator. Then we need to allocate a register for every argument, so this allocator register for every argument,
and the calling convention says that the first argument is in the XMM register number zero,
then the first in XMM one and et cetera, so since our register allocator returns them in the right order, if we assign the first argument, we already ensure that the arguments, the value of the arguments are in the correct register, and since I'm going to use two temporary registers during the compilation,
well, I just allocate them. These are like having two temporary variables. Well, hopefully the source code will not use these names.
If they do, well, you get the problem. So, of course, in a real world example, you need to handle this case. So let's see if we are improving. And yes, now we just need to implement the path statement, which is the leaf of our body, and let's implement it by, well, we don't need to emit any code for the path statement, of course.
And success. So we manage by writing these methods and following the visitor pattern to pass this very first test, which took a while of putting things all together,
but now we have a framework in which we can easily implement all the other features of our JIT. So the second test we want is to return a simple constant. So let's see what happens.
Okay, I need to implement the return statement. So the return self node, I want to show you the AST again. So this is the AST I have now, and you see I have the return statement, which takes the num expression,
which takes the constant 100. So what I need to do, well, is to write methods for all these node subclasses.
So the idea for evaluating an expression is to use the stack. So whenever you have a tree of expression with binary expression, operands and et cetera, what you want to do is to evaluate the two children, put them in the stack, then pop the two values from the stack and compute the value.
This way you can handle easily deeply recursive data structure like binary expression and expression in general. So what I want to do for the return statement is that I want to visit the node.value.
So this is going to put the value 100 on top of the stack, so then I want to pop the value of the stack inside the register xmm0, which is the register which is used to return a value from a function.
And then I want to emit the return instruction in assembler. And this, of course, is going to fail now because I need to implement the num thing, and then when I visit a num node, what I want to do is to put a number on the stack.
So I need to use the move as the instruction, and I want to load the value inside the temporary register 0, and I want it to be a constant value, and the value is node.n.
So in my example node.n is 100. And then once I have the value in the register, I can push it on the stack. So you see what happens? So by using visit, I'm walking down the tree,
and then when I emit the code for release, I put value on the stack, the stack is at runtime, and then visiting the nodes up, you pop value of the stack, you compute, you put it on the stack again, and etc. And by doing this, at the end, you have the result you are supposed to have.
So let's see if it works. Okay, pushd, pushsd, I think it is, and it works.
And so we need basically to continue, so I have more tests, so I want to handle arguments. So in this case, we pass two arguments, which are going to be in register xmm0 and 1, as I said,
and I want to return one of the two. So if we execute, we don't have the name, so let's see what's happening.
So this is our tree now, so you see the function then contains the arguments, these arguments are going to be put in the register by the register locator,
and then I want to return an expression which is a name, so I need to write the logic to load the name from the register, and to put it on the stack.
So which register? Well, of course, we have a register locator, so the register locator takes care of giving me the correct register from the name of the variable, which is node.id, and then I can pushsd, and t should be enough to pass the test,
because now we are doing the same as before, but instead of having a num, we have a name, which puts the right value on the stack, and it works. I'm impressed that it works. I'm not doing any dummy mistake.
So let's try something more complex. Now things are getting interesting. So we want to handle operation now, so how does it look like? So we need the bin op node. So the bin op self node, self.show node.
Let's see how it looks like. This is the tree. So you see that now you start to see the recursive structure of the tree, so you have function def, which has a return statement, the return statement as a bin op expression,
which is add, and the two arguments, the two operands to the expression are two names, a and b. So the idea is that to implement a binary operation, we want to visit the left children to evaluate it and to put it on top of the stack,
then to visit the right to put it on top of the stack, then we push the two values from the stack, we add them together, and we push the value again on the stack. So in this way, if you have a complex expression with many binary operations and parentheses and et cetera, well, by doing this push and pop, you get the right precedence of operands.
Oh, there's a quantization fault. I managed to get the exact fault. So for example, in this case, the op name, so the add that you see in the picture, is basically the name of the class
of the op attribute of the node. So what I want to do is to, well, for now, I'm just implementing the add for the addition,
so what we want to do is to visit the left children, to visit the right children, and then I want to pop the right value
on the temporary register one, I want to pop the second value, I want to add them together, so the add as the assembly instruction is basically a plus equal,
so it's going to sum temp zero and temp one, and to put the result in temp zero. So now I want to put the result of this in the result of this on the stack.
So if everything is okay, it should work, okay, add as D, and it works.
So next test. Okay, I want to check that I can use more instruction and not only the add, so the assertion is failing because, well, the operation is no longer the add, so what we want to do is something like this.
Okay, let's see if I can copy and paste to speed up things, something like this.
So for each instruction, I get a new operation, so what I can do is something like ops of op name and then call it and see how it's working.
So next test is assign. I want to be able to assign values to the variable, so I need to implement this assign node,
and again, I think I'm running out of time, so I'm going to cheat a lot, and instead of typing, I will just copy and paste, sorry.
But by doing it this way, we are going to have more time for questions. What's happening here? But as you can understand now, the logic is similar. So the node.target contains the name of the variable for the assignment, so I get the register corresponding
to the variable, I visit the value, I put it on the stack, I pop it, and I move the value from the stack into the register which represents the variable. And so then we have something which is a bit more complex,
which is if, so now we have to do labels. So again, I'm going to cheat a while. So basically the idea of this is that I want to,
okay, I'll show you the node, it's easier, okay? So you see now that the tree is getting complex, but basically the if has a test,
which is the expression in the if, and then it has a body argument which contains what you want to do in case the condition is true. And it also has a else clause, but we are not going to implement it. So what we want to do is to check
that we don't have any else because we are not going to implement it, then to declare a label for the 10,
to declare a label for the end, so these are points to which we can jump. So this is the several code we are going to generate, and now I'm going to generate it for real.
So the operation is, the test operation is inside the test attribute, so we are going to read it, and then we're going to put it on the stack. We are going to implement only the less than comparison for now. So if it's a less than, so I have the value on the stack,
so I can jump if it is below, and I jump to the 10 label, and if I didn't jump to the 10 label, I want to jump to the end label, so it's asm.jmp end label.
And then I can declare where is the label, so at this point in December, this is where my 10 label is, so it means that if this, basically if the condition is true, I'm jumping to this point, and I'm not executing the jump to the end,
so this is how it works at low level, and then I can visit, I can visit the children, and so by doing this, it means that if the condition is true, we jump from here to here,
and then we are emitting the code for the return, for example, and if the condition was false, we jump to the end label, which is here, and so we don't execute anything basically,
so this is the logic. Again, I am running out of time, so I will copy and paste the compare, which is the same logic again and again, so I visit the left node for the comparison, I visit the right node,
and I use this instruction to say, okay, I want to compare them, and depending on what was the comparison, in this case it was LT, then I know here that I had to use jump below, if it was like LE, which is less equal, I had to use jump below or equal,
or something like this. So, let's see if it works. Assertion error. Maybe I don't think it's known, but probably it's an empty list,
now that I think of it. Yes, not. So, set.asm. It is not working in the pair program, you should have told me that I was wrong. I mean, this is not a pair program, it's like hundreds of people programming. Okay, it's working. So, basically then,
you have the same thing for a while loop, which I'm not going to write live, because it doesn't make sense, it's really the very same logic, I can just copy and paste, and implement it this way.
Okay, so now we have a JIT compiler, which can do all these things, we can do whiles, we can do ifs, we can do assignment, so this is the logic for the while, you see, we can do comparison and et cetera. So, we have like a complete enough system, I also want to add a decorator,
so I can write a nice JIT.compile around the Python function. So, how do I implement it? Well, I implement the compile function here, it takes a function, I get the source code, from inspect.getSource of fn,
then I instantiate an AST compiler, and I compile it to a JIT compiler function. So, let's see if it works. It does. So, basically now,
my JIT compiler is complete, and I can try it with something real, so I have this benchmark, this is a function which computes in a very inefficient way, the digits of p, and I'm going to run it on top of C Python,
like normally, and then I am going to JIT compile it using my compiler here, and then let's see if it's faster or not. Disclaimer, if it doesn't set fault.
Okay, so you see, it's more than ten times faster, and so basically, because what happens is that we are compiling this, putting it into assembly, executing the assembly, and of course it's faster than C Python. Just for the sake, if we're running on top of PyPy,
it's even much faster, because PyPy is a real JIT compiler, it's a toy JIT compiler which is not optimized for speed, and by the way, in a few minutes there will be another talk about the PyPy JIT compiler by Ronan, and so if you are interested in seeing how PyPy managed to be ten times faster than T, a simple thing, well, you should listen to him.
So basically, that's it. Before ending, I want to challenge you, because I, we have this game, write your own JIT, so this is the GitHub repo containing this code, so you are welcome to fork it,
to add new features, new statements, new operations, whatever you want, you send me a pull request, I won't accept the pull request, because I need the toy language for the talks, but I will maintain a list of interesting pull requests to show people how to extend these. So I think I am done,
and if you have questions, thank you Antonio for the amazing talk, so any questions, please come to the microphones,
there is a question there. I don't think I understand the return statement, I'm not sure if it, does it actually use the return register, or does it just take the last thing that was in the stack and return it? So, return statement is here,
so basically the calling conversion of x86 for function using returning a double, it is that you need to put the return value in the XMM0 register, and then execute the RET assembly instruction, which is, which is this.
So what I'm doing is to visit the node containing the expression, so when I do return 100, 100 is the node.value in this example, so I visit it, I compute the value, I put it on top of the stack, of the x86 stack, and then I put this value in the XMM0 register,
and then execute the RET. Where do you put in that register? Where do you actually put it in the register? It's here, the PopSD. So PopSD is well, it's a helper that I wrote to pop the value from the stack into register. Okay, so
I have another question here. First, thank you very much for the impressive live coding session. Do you, like, there is no handling of overflow, like if you, instead of returning 100, you return like 10 I don't know, billions, it's gonna it's gonna suck fault or something.
Yes, I mean, we are using a floating point, so even in Python there is no handling of the overflow. If you're using double variables, I mean float variables, they just overflow or get to inf or whatever. Okay, so the num class actually converts everything to floating points. Yes, it's something that, for simplicity we are using only floating point variables here.
So, yes, I mean. Okay, any other questions? Well, we have time. Okay, fine. Since you're a PyPy developer,
my understanding was that you were working on this kind of code, yet you said that you never did assembly code before. So, what are you doing? I'm pretending to work on PyPy and then
the others do the real work. No, actually, PyPy is a complex project and it has various layers so what we do for JIT compiling is that you start from the Python code, you do many layers of things and then at some point you end up with an intermediate code which is
similar to the assembler but it's generic. And then we have many backends so from this intermediate code you can compile for x86, you can compile for ARM, you can compile after PowerPC and this is the job of the backend. So when you write a new JIT backend you turn this intermediate
into real assembler but I never worked on any backend for real CPUs. I worked on the backend for the .NET virtual machines years ago. And so most of my work in PyPy is above this level so I do things which produce the intermediate representation but then it's the job of the backend and I'm not an expert
of them. If you want. That's why I never had to write any real assembler but of course the intermediate language is close enough that you still need to think about registers and variables and stocks and etc.
Okay, any more questions? Okay so then I guess thank you again. You're welcome.