How to get a JIT Compiler for Free
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 199 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/32540 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
BitComputing platformVirtual machineJust-in-Time-CompilerVirtualizationOpen sourceFreewareProjective planeInterpreter (computing)State of matterMarginal distributionMathematicsBit rateLecture/Conference
00:53
Virtual machineMultiplication signComputer configurationDifferent (Kate Ryan album)OracleSet (mathematics)Symbol tableSummierbarkeitMobile appImplementationProcess (computing)Performance appraisalWebsiteRow (database)MathematicsMoment (mathematics)Order (biology)BenchmarkWater vaporInterpreter (computing)Endliche ModelltheorieOpen sourceFunctional (mathematics)Just-in-Time-CompilerType theoryOrder of magnitudePartial derivativeMechanism designGoodness of fitPhysical lawBounded variationUniverse (mathematics)Projective planeIdeal (ethics)SpeciesInstance (computer science)Constraint (mathematics)Java appletVirtualizationOffice suiteBitProgramming languageSlide ruleObject (grammar)MetreParameter (computer programming)Adaptive behaviorComputer animation
05:01
Instance (computer science)Standard deviationPolymorphism (materials science)Ideal (ethics)Social classCASE <Informatik>ImplementationBitInterpreter (computing)Image resolutionPhysical systemBoolean algebraResultantPerformance appraisalInheritance (object-oriented programming)Block (periodic table)Object (grammar)Pairwise comparisonFactory (trading post)IntegerSlide ruleReading (process)Goodness of fitSpacetimePoint (geometry)Primitive (album)Euler anglesVotingDivisorData compressionLevel (video gaming)Cellular automatonMessage passingUniform resource locatorAdditionUniverse (mathematics)MereologyRight angleVector graphicsComputer animation
07:43
Source code
08:08
Slide ruleVirtual machineParameter (computer programming)BytecodePairwise comparisonResultantFactory (trading post)Message passingOperator (mathematics)Stack (abstract data type)Logical constantBlock (periodic table)Set (mathematics)Interpreter (computing)Vector graphicsEndliche ModelltheorieFunctional (mathematics)Field (computer science)PredictabilityMathematical singularityRule of inferenceVotingComputer animation
10:14
Water vaporMessage passingMereologyProcess (computing)Frame problemFunction (mathematics)Limit (category theory)Pattern languageComputer programmingHydraulic jumpSequenceBytecodeInterpreter (computing)ImplementationMultiplication signJava appletLoop (music)Computer animation
11:56
Pointer (computer programming)Machine codePoint (geometry)Position operatorParameter (computer programming)BytecodeJust-in-Time-CompilerSimilarity (geometry)Subject indexingImplementationLatent heatDifferent (Kate Ryan album)Loop (music)Stack (abstract data type)ProgrammschleifePerturbation theorySequenceBitInstance (computer science)CodeOperator (mathematics)Mathematical optimizationMultiplication signTracing (software)MetreGroup actionElectronic mailing listHydraulic jumpBranch (computer science)Field (computer science)CompilerProjective planeBenchmarkReading (process)Pattern languageFrame problemMedical imagingSlide ruleQuicksortType theoryIncidence algebraAreaPhase transitionUniformer RaumFunctional (mathematics)Object (grammar)Speech synthesisModal logicCASE <Informatik>View (database)Kernel (computing)Volume (thermodynamics)XML
17:23
Instance (computer science)IntegerVariable (mathematics)Reading (process)Overhead (computing)Virtual machineBinary codePosition operatorSubject indexingMessage passingChainSocial classContext awarenessInterpreter (computing)CodeSet (mathematics)Representation theoryNetwork topologyData structurePolymorphism (materials science)Cache (computing)Chemical equationLevel (video gaming)Representation (politics)Parameter (computer programming)Formal languageRun time (program lifecycle phase)ImplementationRegular expressionInheritance (object-oriented programming)Perspective (visual)Abstract syntax treeBlock (periodic table)Functional (mathematics)Graph coloringConstraint (mathematics)ParsingBytecodeOrder (biology)GodMedical imagingResultantAreaSheaf (mathematics)SpacetimeMathematicsUniverse (mathematics)Correspondence (mathematics)Operator (mathematics)Maxima and minimaMassBit rateWordSummierbarkeitCuboidLine (geometry)Asynchronous Transfer ModeoutputPhase transitionComputer animation
22:32
Revision controlCompilerPiData structureObservational studyDerivation (linguistics)Operator (mathematics)Software maintenanceJava appletHypermediaLine (geometry)Execution unitProcess (computing)Interpreter (computing)DivisorVirtual machineSound effectMultiplication signMoment (mathematics)Point (geometry)Software bugFingerprintBlack boxImplementationBitMetreControl flowResultantComputer animation
25:13
Endliche ModelltheorieComputer programmingInterpreter (computing)Student's t-testInstance (computer science)Right angleDomain-specific languageCASE <Informatik>Java appletFormal languageSampling (statistics)Office suiteImplementationStructural loadOpen sourceScripting languagePhysical systemWeb pageCompilerContext awarenessError messageClient (computing)BitEmailHypothesisProgramming languageLecture/Conference
26:49
Interpreter (computing)Constraint (mathematics)Kritischer Punkt <Mathematik>Multiplication signInsertion lossFormal grammarComputer animation
28:00
AreaTerm (mathematics)FrequencyDegree (graph theory)Dynamical systemLecture/ConferenceJSONXML
29:53
BitLevel (video gaming)JSON
30:58
Meeting/Interview
Transcript: English(auto-generated)
00:00
Next talk, Stefan Marr, how to get a JIT compiler for free. Yeah, so I'm going to talk a bit about virtual machines, or actually just simple interpreters instead of just-in-time compilers. Because one of the big problems we have is that way too few people actually
00:20
work on the virtual machines they use. So open source projects in general typically you have way too few contributors but then typically you have only one person maybe, well I think there are at least two people in the room who have been touching virtual machines. But there are way too few people
00:40
actually looking into how to get all that kind of stuff fast and how to make it work on all kinds of platforms. So I've been looking a bit into that and what you can do about it. And I've been looking into that using a very simple small talk because it's easy to implement and you can implement it in all kinds of different variations and that's actually a small talk they have been using for teaching.
01:04
So down below you see a couple of the universities who have used that and the main design constraints here were to have something really simple for teaching, to have something that's really showing the concepts in the ideal textbook way which typically means it's not very fast and then we have
01:24
a couple of interpreters implementing that actually have all kinds of different implementations and all kinds of different programming languages by now. So but as I said, one of the main drawbacks is the whole thing is pretty damn slow. But it's easy to teach, as you can see here, on two
01:44
of pretty common object oriented benchmarks originating from the small talk world, Richards, well actually Richards I don't think so, but at least doesn't do. But they have been used for small talk, JVMs, JavaScript and Python and you see like Python with X slower or 70X
02:04
slower than Java. Java is of course well tuned and hundreds of many years spent into making Java fast and our simple interpreter of course can't keep up with that. So what we want is of course something different. We want a simple interpreter that perhaps everybody of the community
02:24
can change in order to make certain things better. It has to be a simple implementation, it has to be conceptually clear to really lower the burden, for instance in small talk you have your parameters. If you want people to be able to add new ones or adapt or basically replace a mechanism by something like native
02:43
boost, then you need people that actually can change the virtual machine implementation. So we want all of that. But of course we also want kind of that, so at least that region the same order of magnitude would be nice. So the green now unfortunately cut off, is actually the Coq virtual machine
03:03
here, so on DeltaBlue barely 20% slower and on Richards it's six times, seven times slower. But at least not the 500X. So that's where we want to go. Ideally there's a kind of very simple interpreter. So how do we
03:22
get there? At the moment, just in time compilers, as I said it's really just a few people doing that, so often it's just arcane magic. So there's one person in the team or in the community doing that. We see that in the Coq virtual machine, that's basically Alirene Randa. The Lua people
03:42
have the guy behind LuaJIT and I think there are other communities like Rekker where it's very simple. And then the other option is basically to have a big, big office or infrastructure for engineers, the Java way, the IBM, the Google way, the same way they built ART for instance,
04:03
just have the engineering resources to do that. None of those solutions is really sufficient, so let's just look a bit into what science came up with recently. It's really unfortunate. Down below you see actually the references.
04:22
The slides will be online later, so you can look it up. There are two interesting options. You might have heard of PyPy. They use infrastructure they nowadays call RPyton. They have a meter tracing just on time compiler, which is a very interesting approach. And then very recently Oracle Labs
04:41
open sourced a project called Truffle and they have self-optimizing interpreters based on partial evaluation and the whole toolchain based on a just-in-time compiler toolchain to do those kinds of things. And I'm going to briefly look into both.
05:02
So what's some, just to give you a brief impression, that's a simple small talk has been developed I think around 2000 somewhere in Aarhus at the university for teaching. Text-based syntax to make it really simple to build the very first interpreter and as you can see here, object doesn't have a superclass
05:22
has a primitive to get the class object or an object has a pointy quality primitive and other stuff. integer for instance has addition, substructure and so on. And then of course we have things like Booleans in the system and that's very much the small talk tradition
05:41
you really have a polymorphic method that's using the standard if true, if false to implement it and if you look at the true class you see if true is actually implemented as you would implement it. Very simple straightforward in an ideal small talk implementation by just
06:02
evaluating the block and in the false case just returning nil. So a very straightforward simple small talk typically based on an interpreter and to explain how that works let's look at the factorial so I don't have a shopping cart, my running example is a factorial
06:22
to keep it even a little bit more simple just to explain I think everybody can read that here I suppose Can anybody not read small talk? Okay, good. Small talk, the first thing you notice is
06:41
those two things here, that's actually one method invocation or in small talk a message send and the receiver so the object on which you invoke that is here the evaluation result of that parenthesis. So we have a simple comparison whether n is zero or not
07:00
and if that's true then we evaluate that block which is a closure and return just one as result. Otherwise we evaluate that block and we'll just recursively evaluate the factorial. So nothing too fancy
07:21
maybe if you can't read small talk maybe you're better with reading something like a bytecode? Oh yes, thank you. I don't know how I can shrink or move my slides somehow I have to try
07:42
to change the resolution once more. Is it going to be a big problem with the recording? Well, let's do it. Can I do the animations here? No, probably not.
08:16
That kills it? I will reset it after that slide I think.
08:28
So the idea is if you want to implement an interpreter the classic thing to do is to build a set of bytecodes and then have a stack machine
08:41
to interpret those bytecodes. So if you actually compile the factorial to a very very simple bytecode set, so in the simple thing in small talk I think we need something like 16 operations some of them are like pushing arguments from the function, pushing constants doing the actual send and then the special things like pushing a block and
09:00
returning from a function. So if you imagine, if you think how that could be executed, your example code, that's actually the first thing we do if we invoke the factorial with a 4, we push the 4 on the stack. And then the second operation is the 0. And thereby we can now
09:20
actually do the message sent here for the comparison and that will on our stack here, so if you don't look into the details of how that comparison works, yields the result false. So very simple stack operation, we use the stack to basically build up first the arguments for the message sent and then add the result on the stack.
09:40
So afterwards you can imagine very much the same for to prepare that send here we push the two blocks onto the stack and use them the result and eventually either the one which was the result of that block
10:00
or the actual factorial will end up on the top of stack here and that's something we can do too. Okay, let's try to make the recorder happy again.
10:48
Better? So I basically showed you now how to interpret such a bytecode sequence based on the factorial program.
11:01
The implementation of that is also more or less straight forward. So we have here in a kind of Python syntax because that's actually more or less the R Python interpreter implemented and you see here the interpret method, you get the message you want to interpret, you have a frame that's basically the stack I showed you
11:23
on the side and what we then do, we basically loop over all the bytecodes as long as we have bytecodes or actually as long as there is a return bytecode. So we start then give the bytecode and then typically that's a very standard thing, we switch over the bytecode and jump to the right thing
11:43
to do. That's very classic, it's not very fast and that's actually more or less useful to see this 500 times slower than Java. So now here comes the meta-tracing of Python, or not Python, R Python
12:03
into play. That's what the PyPy people also use to implement their really fast Python implementation. So the only difference that comes actually in here is that we tell basically the R Python infrastructure that that specific
12:23
bytecode loop here can be the start of normally hot loops or that's an interesting point to start to generate native code. So if the R Python notice at such
12:44
a point, such a merge point for the just-in-time compiler that a certain loop has been executed again and again, then it can assume, okay now we can actually spend a bit of time to compile that into really fast native code and generate our
13:04
do all our optimizations to generate that code. That's not always useful so that's why we need to do that explicitly. But that's more or less what we actually need to do. So what does the tracing then actually mean?
13:22
Well, let's look at our factorial example again. So what's happened now is we execute we start executing that method and we take basically the first bytecode. So what that means? If we are now in the tracing mode, we actually ignore things like jumps and
13:42
branches and so on and instead really just record the important actions and certain assumptions. And I list also all the assumptions, so really just recording reading and writing the fields because that's what's really interesting. So what happens is we actually need to go to the method object, get the bytecodes out
14:03
yeah, we remember that in the variable and then we go to the index we had and read out the bytecode and the arguments. And then as I said, we are not really recording any kind of jumps, which has now happened here to switch, but we really just record the actions.
14:22
But we know now that actually the first bytecode in the method was the push argument bytecode. So what now happens is the push argument bytecode is exactly that segment and as you also might notice, we actually didn't note down any kind of method invocation because it's all not
14:42
really relevant. We really just want to know what were the concrete actions performed. And then here again we start this frame get argument. So we go to the frame go to the arguments area, get that arguments value out there and basically note all those operations down. Then we want
15:02
to perform the push itself, so we need to go to the frame, we get the stack pointer out, we increase the stack pointer we write the stack pointer back, and then we get the stack pointer on the frame, set the specific position at the stack with the value and now we actually pushed the argument.
15:22
And if you now continue and imagine first you of course need to increase the bytecode index which we don't see on this slide, but then we execute the next bytecode. And then it happens all over again. At least very similar sequence of operations.
15:42
And there you might already see, actually we do quite a lot of stuff again and again, and that's where actually the tracing and the meter tracing in particular has a lot of benefits here. So if you for instance look a bit closer at the trace and you see here at the beginning, just accessing that bytecode
16:01
array, we do at least twice. And at least for every bytecode once, so now I have two bytecodes only executed but methods or typical execution traces will be much much longer. So then the other thing is here access to the stack array which should be happening somewhere down here as well.
16:21
And we can really just remove those operations because we put that here in a temporary variable already and that's really the same value. And a simple optimizer can see that. And then we see here the same kind of stuff pattern again increasing the stack pointer. We can also ideally
16:41
optimize it. And the people around the R-Python project did the necessary research and they actually showed that about 90% of the operations in such a trace can be on average eliminated. So I think it was between 40 and
17:00
97% depending on the kind of benchmark they run to really eliminate all those operations and then you can ideally really reach the peak performance for a specific problem by using that kind of tracing compilation. And I think the success of the PyPy project
17:20
shows that very much. So that's only one approach to implement interpreters. And I had to explain you a stack machine and how to do all that and that was already a bit hairy and not very straightforward and it was also not very clear how to compile actually that code to a bytecode set because it's a linear representation of something which if you would use a simple
17:44
parser probably would more look like a tree. So if you parse that function then we have the true and false as the main message and the receiver on the left and then the arguments
18:01
the first block and the second block in the tree structure. And actually implementing interpreter for that is pretty straightforward as well. Maybe even more simple. So if you imagine we had here such literals. So they're really in the code
18:20
as a constant given and we can just implement that on those AST nodes that we want with an execution set and really just return the value. And very similar for reading a variable. So if you compile that code at least in small talk it's very simple to know at which position a variable should
18:42
be. So we can just remember a certain index or a name for the variable and then just really read all the variable. Then we interpret that AST and execute the execute method. So that's really very straightforward. I would say even simpler than that stack machine.
19:02
And then we can even look at something more complex here, a message send. What's a message send doing? We have a receiver and maybe an argument. So if you have a binary message send, what do we do? Well we simply execute the receiver expression because the receiver is going to know what actually it has to do to return us a value. And
19:22
the same with the argument expression, we also just return that. And here I didn't have to encode a stack explicitly. I can actually just reuse a stack of my implementation language and just set in my code here receiver argument variable and so on. And don't have to think about stack balance and what do I have to push first, what do I have to
19:44
do? So implementing that kind of interpreter seems to be much more natural. And then you can go to the receiver, get its class, do a lookup with a selector and then you get the message that needs to be invoked. And we just return the result here.
20:01
So at least from my perspective that seems to be much more natural and simple way to implement an interpreter. And that's where Truffle comes in. So the Oracle people here actually came up with an idea to optimize those kinds of AST based interpreters by at run time
20:20
specializing those ASTs based on what actual values were observed. So taking that message sent, one of the most important things here to specialize for is actually the lookup, the receiver and the message we eventually want to execute. So especially in Smalltalk where you have to walk dynamically
20:43
the inheritance chain to find the message or the message that corresponds to the selector, that's pretty expensive. So I'm only sketching how that works here, especially with time constraint. The idea is really you have that AST
21:03
tree here and that's then specialized to an execution tree that has very specific nodes in there that really just perform the minimal operations or go to a fallback node. So one thing you can see here indicated with colors
21:23
that variable read for instance became read an integer to avoid overhead of boxing on a JVM for instance. And that generic binary sent converted into a cached send. There's a couple of externals. So the class check is an external
21:42
and then here's actually an inlinable send because often you get much better performance if you inline a method because then you can specialize it in the calling context. Because sometimes you do really have polymorphic message sends that one single cached node might not be enough
22:03
so you have actually a chain and so you can have something like a polymorphic inline cache which is customly tricky to implement in a virtual machine but very important especially for small talks. And here it's really a tree structure or well at least
22:21
I think it's still a tree and of nodes and you have a very high level representation of all of that. And that builds very nice interpreters and then you get actually pretty fast. And how fast? Is it just a toy or is it really what we want?
22:41
That's the results. So what I have to say at the moment the Arpa Python version, so the stuff running on top of Arpa into some meter tracing I had the support of Karl-Thierry Boltz, one of the PyPy guys who actually made it that fast and as you can see here
23:01
we are barely 30% slower than Java if you remember the Cockroach machine was 20% slower than Java so we are still a little bit behind so on Richards we are in a factor 10 which is more or less our goal so just
23:22
barely slower as Cock as well. The truffles here unfortunately still stumble into some bugs here and there just recently the Arpa guys actually thankfully fixed a couple of things for me. Non-local returns one of the most important features in Smalltalk for the controlled structures
23:42
wasn't supported all that well and of course a lot of re-compilation problems. So I haven't been able to tune that as much yet and it's also much more to the implementer here than to the black box toolchain but I think we can get where we want.
24:02
So now as a conclusion, the question was can we have maintainable designs of implementations and clarity to basically reach more people actually maintaining interpreters and virtual machines and I think yes, that's totally possible. So the
24:22
nice techniques they have is either the meter tracing or the self-optimizing interpreters, that's a good foundation for that stuff. When it comes to performance, well, I would say at least it's plausible but it still has to be seen exactly but I think it's also possible. There's both things to reach
24:42
that kind of performance. And there is a leading I hope that's leading to me investing a bit of time to push those kind of ideas into Marte and maybe the cockroach machine at Siestapark together with the Farouk people and yeah, but that's for the future
25:02
so there is nothing happening yet. So are there any questions?
25:21
The subject of my thesis is make your own programming language maybe interpreter or a simple compiler or anything but it has to be really designed into a ERP system
25:40
are there any tips you can give? It probably depends on the ERP system but if you really just want to implement a very simple interpreter for instance to make a specific language It has to be a scripting language to actually
26:03
catch errors from printers, from mails, so if a printer gets an error, then the client has to be able to write a script to actually say yeah, those persons need to be mailed and that mail need to be sent.
26:23
Well, depending a bit on what kind of infrastructure you're talking about so the truffle stuff with the AST interpreters which I would think is a very nice and simple idea that can also be in different contexts that runs on Java so you could just go to their open source page
26:42
and look a bit around. There is not a lot of documentation which is a bit of a problem but there is like a simple language implementation which shows a couple of concepts so you can try to look it up so it's under the ground umbrella of the OpenJDK
27:00
and otherwise you can also easily implement on top of something like Faro or Newsmall and talk a simple AST interpreter that's very straight forward, so just grab your textbook and go devise a grammar, implement an AST interpreter
27:22
I don't know what to avoid more So those techniques we have here are rich for performance and I don't know whether that's a critical point in your thesis If so, then
27:53
Due to time constraints I would like you to take this offline and talk later Thank you very much
29:33
So let's see
29:52
Sound okay? It's a bit cut off but I guess it will work
30:35
Yes, please get a bit closer to the speakers