We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Deoptimizing Ruby

00:00

Formal Metadata

Title
Deoptimizing Ruby
Title of Series
Number of Parts
65
Author
License
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language
Producer

Content Metadata

Subject Area
Genre
Abstract
Ruby is notoriously difficult to optimize, but there is a solution: deoptimization. This means jumping from compiled code back to an interpreter, and it allows Ruby implementations to take shortcuts, make guesses and pretend Ruby is simpler than it is, but at the same time still be ready to handle the full language if it’s needed. You’ve seen talks about what makes Ruby slow: monkey patching, bindings, set_trace_func, object space & so on. We’ll show how with a Ruby implementation using deoptimization, such as JRuby+Truffle, you can use these features without any runtime overhead at all.
39
OracleState of matterOpen setTeilauswertungJust-in-Time-CompilerSource codeCASE <Informatik>Lattice (order)BlogJust-in-Time-CompilerProjective planePhysical systemDifferent (Kate Ryan album)Product (business)BytecodeBenchmarkPerformance appraisalCompilation albumImplementationElectronic visual displayLibrary (computing)Open sourceJava appletSource codePartial derivativeElectric generatorStudent's t-testMereologySign (mathematics)Point (geometry)Client (computing)Series (mathematics)Order (biology)Process (computing)NumberRing (mathematics)Set (mathematics)NP-hardBit rateWave packetSpring (hydrology)Degree (graph theory)Game controllerEvent horizonMultiplication signContent (media)Video gameWordFamilyComputer animationLecture/Conference
Bit rateSpacetimeInterpreter (computing)CodePartial derivativePerformance appraisalNumberObject-oriented programmingMultiplication signIntegerNeuroinformatikInterpreter (computing)QuicksortBitPatch (Unix)Formal languageSpacetimeImplementationCoprocessorReal numberTerm (mathematics)Variable (mathematics)Kernel (computing)Data storage deviceLambda calculusBlock (periodic table)CodeKeyboard shortcutAdaptive optimizationJust-in-Time-CompilerElectronic mailing listBlogThread (computing)Exception handlingLine (geometry)Compilation albumGoodness of fitDifferent (Kate Ryan album)Functional programmingDefault (computer science)NP-hardBit rateAnalogyGroup actionPoint (geometry)Arithmetic meanSymbol tableInformationMathematical optimizationOffice suiteWorkstation <Musikinstrument>Process (computing)ResultantRow (database)WordLevel (video gaming)Reading (process)Computer animationLecture/Conference
Table (information)Right angleMultiplication signMessage passingGodComputer animation
BitTable (information)Key (cryptography)AnalogyRight angleGodMathematical optimizationComputer animation
CompilerJust-in-Time-CompilerCodeJust-in-Time-CompilerImplementationObject-oriented programmingAdaptive optimizationReal numberSound effectPoint (geometry)NumberHuman migrationMultiplication signCompilerWorkstation <Musikinstrument>Computer animation
UsabilityCache (computing)Branch (computer science)Just-in-Time-CompilerImplementationMathematical optimizationCodeBitIntegerComputer programmingDynamical systemDifferent (Kate Ryan album)Adaptive optimizationSummierbarkeitState of matterStack (abstract data type)Interpreter (computing)Type theoryVariable (mathematics)Point (geometry)Multiplication signCASE <Informatik>Object-oriented programmingBuffer overflowSystem callNumberHydraulic jumpSocial classFinitismusOrder (biology)FreewareAnalogySpacetimePatch (Unix)Meta elementFormal languageVirtual machineCache (computing)InformationRevision controlKeyboard shortcutQuicksortProcess (computing)Set (mathematics)CalculationCompilerRight anglePerformance appraisalJunction (traffic)Logische ProgrammierspracheSelf-organizationMachine codeTranslation (relic)Thomas BayesForcing (mathematics)Digital electronicsInformation overloadAutomatic differentiationMetropolitan area networkLocal ringCoprocessorRule of inferenceMathematicsWordCasting (performing arts)Disk read-and-write headMetrePRINCE2Workstation <Musikinstrument>Graph coloringSinc functionComputer animation
Web pageProgramming languageMultiplication signAdaptive optimizationObject-oriented programmingInterpreter (computing)Thread (computing)Exception handlingOverhead (computing)FlagConcurrency (computer science)SpacetimePoint (geometry)CodeOperator (mathematics)ImplementationMoment (mathematics)Loop (music)Web pageBitVariable (mathematics)Stack (abstract data type)System callLevel (video gaming)Hydraulic jumpElectronic mailing listBuffer overflowCASE <Informatik>Library (computing)Pointer (computer programming)Local ringPatch (Unix)Different (Kate Ryan album)Just-in-Time-CompilerSpeicherbereinigungMereologyFormal languageLine (geometry)Core dumpShared memoryFrame problemUniform resource locatorCompilerData storage deviceSemiconductor memoryCrash (computing)Extension (kinesiology)Projective planeReal numberPhysical systemProgramming languageMachine codeWorkstation <Musikinstrument>Web crawlerReading (process)SummierbarkeitStaff (military)ProgrammschleifeView (database)Tracing (software)Integrated development environmentMathematicsType theoryRight angleArithmetic meanService (economics)Set (mathematics)Event horizonFunctional programmingSound effectVirtual machineMachine visionLabour Party (Malta)Computer animationLecture/Conference
Twin primeLogical constantAlgebraic closureExtension (kinesiology)Keyboard shortcutExecution unitExtension (kinesiology)Buffer overflowMathematical optimizationArray data structureQuicksortObject-oriented programmingBenchmarkRevision controlSemiconductor memoryData structurePatch (Unix)CodeRepository (publishing)Projective planePower (physics)Multiplication signDigitizingNumberGraph (mathematics)ImplementationString (computer science)Algebraic closureRegular graphParsingOrder of magnitudeFilter <Stochastik>Core dumpMedical imagingLibrary (computing)Interpreter (computing)Product (business)Real numberRepresentation (politics)AverageKernel (computing)AuthorizationComputer programmingSpacetimeDynamical systemDependent and independent variablesFile formatPhysical systemUniversal product codeMoment (mathematics)System callFormal languageMaxima and minimaExtreme programmingMemory managementThread (computing)Just-in-Time-CompilerBinary treeConcurrency (computer science)Gastropod shellArithmetic progressionStudent's t-testBlogComputer scienceDebuggerLocal ringHash functionResource allocationOperator (mathematics)Figurate numberSoftware developerPermutationPosition operatorWorkstation <Musikinstrument>OracleBit rateLimit (category theory)Bookmark (World Wide Web)Data recoveryOffice suiteMetropolitan area networkForcing (mathematics)Network topologyBound statePiBoss CorporationMassP (complexity)Line (geometry)Binary multiplierAutomatic differentiationPerformance appraisalGame theoryArmUniverse (mathematics)CompilerBitSelf-organizationFactory (trading post)HypermediaDiagram
SineSoftwareCompilerInformationBenchmarkProjective planeTwitterBlogComputer animationLecture/Conference
Transcript: English(auto-generated)
Hi so much for coming. My name is Chris Sees and I'm a PhD student at the University of Manchester in the UK.
I work part-time for Oracle Labs. Let me talk today about de-optimizing Ruby, about how de-optimization is the antidote to JIT compilers. Oracle wants you to know that this is just research we're doing here. This isn't a product announcement, so you shouldn't buy any Oracle stock or any product based on what we're saying here today.
It is just research. I've written a blog post which covers all the background to this. So if you want some more technical depth, there's a blog post which displays everything. I'm going to be making some performance claims about what our system can do. And this blog post provides all the really detailed scientific explanation for how we run those benchmarks
and how to reproduce them for yourselves to verify our claims. So I'm working on a new system called JRuby Plus Travel. It's a new open source implementation of Ruby by Oracle Labs as a research project. We're the JIT using next generation JVM technology and partial evaluation.
It's now part of JRuby. It started off as an independent implementation. The JRuby community have been very welcome to us and we've now merged our implementation into JRuby. The next generation JVM technology we talk about is a new JIT compiler available in JVM called Graal. What we've done is we've translated the hotspot JIT compiler into Java.
That means we can use it as a library and now we can control it directly from our Ruby implementation, while we're having to run on a partially bytecode and hoping it will do the right thing with that bytecode. You may have heard of different source of JIT compilers such as method JITs, such as Rabinius and JRubyUs. You may have heard of tracing JITs. And that's what things like PyPy and Topaz, the implementation of Ruby similar to PyPy, have used in the past.
We're using a different technique to that called partial evaluation. Tom Stewart gave a talk on partial evaluation last year. So if you're interested in the technique we're using to compile Ruby, you can go over 30 years of talk.
But I'm not going to talk about partial evaluation or the JVM or how our techniques work today. What I'm going to talk about instead is why is Ruby hard to optimize? And what is the one thing we need to do to make it easier? So why is Ruby hard to optimize? You may have heard lots of people talk about different features of Ruby to make it a tricky language to work with. They're good features because it makes it easy to program, and that's what we care about more.
But for the few people who implement Ruby, why are they tricky features? They're probably things you've seen talked about many times before. When we started implementing Truffle, we looked at a blog post by Charles Nutter on what you need to implement in Ruby before you can start making performance claims. He listed a lot of the things here, and he also listed a lot of other people's blog posts about what's hard about Ruby.
So fix num to big num promotion. When you have an integer and it gets very large, Ruby will also have to start using big num for you. But that means that you have to do a check every time you use some arithmetic to check if the numbers overflow. And those sort of checks quickly add up to be really, really expensive, and they add up to sort of swamp your computation.
And this is a theme throughout all these things, having to check something. Monkey patching methods. If you can monkey patch a method at any time, that means you always need to check the method hasn't been redefined. So if you imagine for each bit of work you're doing, you also have to do some check to check that something hasn't been redefined or you haven't overflowed.
And quickly, the checks become more work than you're actually doing in terms of real work on the processor. Binding. So kernel binding and proc binding. These are frequently listed as a couple of the things which are the most hard to implement in Ruby. So kernel binding allows you to get an object that represents all your local variables wrapped up
so you can manipulate it like a Ruby object. And proc binding allows you to do that for a proc or a block or a lambda or something like that which you've already got. And they make it very tricky to optimize Ruby because they mean that you can't, an implementation of Ruby can't store its local variables on the stack like C would. Because you're always able to get them and modify them as a Ruby object at any time.
Object space is listed as one of the things which is tricky to optimize Ruby for because it allows you to get any live object at any time. That means you can't easily remove objects, pretend they don't exist because you always have to be able to get a list of them for real. And in JRuby, for example, object space is disabled by default because of the cost of implementing it.
SetTraceFunc, again, is disabled by default or simply not supported in some implementations of Ruby. This allows you to install a function to be called every line and every method called in Ruby. And again, that means you have to have a check every time because you need to check is there a set trace function. So again, these checks become more work than you're actually doing. And ThreadRaise allows you to send an exception from one thread to another.
Again, you need to keep doing checks, therefore, have I got something I should be raising in this thread? So those checks are the problem that we need to get rid of. And deoptimization elegantly solves all these problems. When we talk about JIT compilers, we talk about optimizing code. So we go from a slow interpreter and we want to jump into fast, jitted code.
So that's how JRuby works, that's how Rabinius works, that's how Topaz works. All these implementations, they go from a slow interpreter ID to fast, jitted code. Deoptimization is going the other way. It's going from your fast, jitted code into your slow, interpreted code. And the idea is, if we can deoptimize at any point, we can make the compiled code forget about the checks.
We can make the compiled code not allocate to objects and as long as we can rely on the interpreter to have the objects there for real. So you're going the other way to JIT, which is why we talk about it as being the antidote to JIT compilation. So before we talk about stuff at a more technical level, let's provide a bit of an illustration.
And I use Alice in Wonderland as an analogy here. So in Alice in Wonderland, Alice goes down a rabbit hole, she's chasing after a white rabbit. The white rabbit disappears into a room. When she enters the room, all she can see are these tiny little doors. And she wants to follow the rabbit to see where it's going. And she bends down and she can open these doors up and she can see a beautiful garden on the other side.
However, she's far too large to fit through. Alice is a tall person, tiny little rabbit, and the door is designed for the rabbit. But thankfully, on a table, she finds a bottle of medicine, which is labeled, drink me. She very courageously drinks this bottle without knowing what it will do.
And she shuts up like a telescope and she suddenly lies in small. Now she can go through the door into the beautiful garden. But there's a problem. She realizes the door was locked and she's left the key for the door on the table. So now she's not able to get through. And because she's shrunk herself, she can't get hold of the key anymore to get through the door. Thankfully, she also finds a tiny little bit of cake.
And that's labeled, eat me. And she eats the cake and she blows up again and now she's able to get the key. But her problems continue from there, but that's where we'll stop the analogy. So relating that to deoptimization, Ruby is Alice. She'd like to go through the little door into the garden of utopia of high performance.
But she's simply far too big. She does too much work. She's got all these checks. She always needs to have these objects available to really compare them and use them as real Ruby objects. The just-in-time compiler is the bottle labeled, drink me. Because if she can drink that, she can shrink down as small as possible and then she can hopefully get through the door.
But when you do that in a conventional implementation of Ruby, you leave something behind. You leave behind set trace func. You leave behind object space. And in reality, most implementations can't get that small anyway because they've got no good way of shrinking beyond those checks. So deoptimization is the solution to that in that it reverses the effects of JIT.
So deoptimization is like the cake which allows you to restore back to your original size. And the idea is, if you can drink the bottle of medicine at any point and you can eat the cake whenever you like, then we can make our compiled code much simpler because we always go back to our full implementation of Ruby whenever we need to.
So what does deoptimization do for Ruby? Let's talk about fixed num to big num promotion. As we said here, the problem is checks. We want to get rid of as many checks as possible. We want to make our code smaller. We want to shrink the code. And drinking this medicine of JIT is about shrinking it.
So let's consider an example to start with. A plus B plus C. We'll assume that we know these are fixed nums to start with, perhaps because we worked it out for some sort of type specialization, or perhaps there's some sort of feature implementation in Ruby where they're annotated as fixed nums. This is the code we need, a pseudo-code, to add together these two numbers. First of all, we add together A and B as fixed nums.
But they may have overflowed, so we need to check, did that overflow? If it does, we need to redo the calculations as a big num. Then we continue with the next set of calculations, so plus C as a big num. If it didn't overflow, then we continue as a fixed num, but we need to check for overflow on plus C as well.
Now that's a huge amount of code. It's got branches in it. Processes don't like branches. A branch will often destroy any kind of pipeline you've got in a processor, even if you can predict them fairly well. And remember, if you have a language like Go or a language like C, the code you'll get when you add together two numbers is pretty much one machine instruction, add.
Here we've got literally potentially hundreds of machine instructions, and this is why fixed num plus big num is really slow. So the solution to this, using dynamic deoptimization, is to say, we'll add the numbers together, we'll check if they overflowed, and then if they will, we'll just forget it.
Everything will stop. We're not going to bother trying to handle that case. And what we do instead is we deoptimize. So that's our compiled code. If we get an overflow, we'll jump back into the interpreter, which implements all this logic, but in our JIT code, we simply implement this logic. Do the A plus B as a fixed num. If it overflowed, deoptimize. We add C onto it. If it overflowed, deoptimize.
This means you then don't have any branches, or you've got one very simple branch on each case, but you don't have any code in it, and you've shrunk your code. Literally, you've reduced the amount of code you have. And less code is pretty much always faster code. You've got a finite amount of space for instructions in a processor, a valuable resource. This allows us to reduce that.
So that's fantastic if your code doesn't overflow, if your numbers don't overflow. If they do, isn't that really expensive? Deoptimization sounds like a really complicated thing, and it is, and it certainly isn't free. It takes quite a long time to deoptimize, on the order of many, many nanoseconds. So what happens if you frequently overflow?
If you've got some code path where the numbers always overflow. Well, after we've deoptimized, back to the interpreter, the next time we compile, we can compile slightly different code. So if A plus B frequently overflows, or has overflowed ever, we can now recompile with that branch in that handles that case. The other one which didn't overflow, we'll keep that as a deoptimize.
So we can gradually make your code more and more open to using these, the different dynamic programming aspects of Ruby as it goes, but we only add them as we need them. We call this a type of specialization, and every time your program specializes the code it needs, and each time we can deoptimize, go back to the interpreter,
then when we compile again, we can include a bit more information that we need, but we're getting bigger very gradually, rather than having it all at the first place. Monkey patching methods. So whenever we have something like myObject.myMethod x, y, we need to have a check. And the simplest implementations of Ruby, which have no cache or anything like that,
don't need a check because they simply do the lookup every time. So the simplest case is lookup myMethod in myObject and then call it with x, y. But any non-trivial implementation of Ruby these days will do, has the object changed? Then look up the method and call it. If it hasn't, use a cached version of that method, and then call it with x, y.
But what we need to do therefore is each time we need to check, has the class been changed at all? Has anyone monkey patched the methods on it? And again, this then becomes a check. So the tiny bit of code we're really interested in doing the call has again become swamped by all this bookkeeping and stuff like that. So we can improve that by, like we did with the overflow checks,
we can say if the class has changed, we'll stop everything and we'll deoptimize. We'll jump back into the interpreter and we'll handle it there. If it hasn't changed, we'll continue and we'll use the cached method. Now we can actually do one better here than the overflow checks. We can remove this check even further.
What we can do is we can say use the cached method and call it with x, y without any checks whatsoever. So we've moved all the dynamic programming, meta programming, monkey patching features of Ruby here to simply have use the cached method and call it. So how do we support deoptimization here?
One of the things we can do in deoptimization, if we go back to the analogy with the cake, we can force someone to eat cake from afar. So if Alice has shrunk herself and then you realize that she's shrunk herself too far and you want to stop her, you can stop her in what she's doing, you can force her to feed cake and she'll grow again. So what we do here is if somebody else,
monkey patches a method, they force everyone else to eat cake and deoptimize. So if your code is still running, then the method you're using cannot have been redefined because if they had been redefined, someone would have forced you to eat cake and you would have gone back to that interpreter code which does those full checks.
But your compiled code you use in the fast path can be nice and slow, nice and fast like that. Binding. So the problem with binding we said was that it allows you to get access to your local variables at any point. So if you consider some simple ruble code such as these local variables where you have
A is 14, B is an array of 8.2 and 3.4. The way almost all Ruby implementations represent this will be they'll have a stack, so this is the program stack with your local variables on it, A and B. A points to a fixed sum. We may be able to do a little bit of optimizations with something like tagged integers to simplify that but in the conceptual case it's pointing at an object 14.
B points to an array which points to other numbers as well. Now this is really inefficient but to allocate these objects we have to call malloc. If you're calling malloc it's thousands of times slower than actually just using registers and stuff like that. It's really, really slow. So what we'd like to do in a really performant implementation of Ruby is to put those values directly on the stack.
But the problem is if they're on the stack how can you get a binding of them? If we have anything allocated to stack this and we call .binding or we've allocated a proc and it's gone somewhere else and been stored somewhere else and we call .binding on that how do we get access to these values? Well in deoptimization we can recreate the interpreter state
so we can recreate all those objects. What we do is we take the call stack with the local variable values on it and we pull them off the stack and put them back into the real objects. Now that sounds like real magic being able to access your own stack take values off it which have been optimized away but remember we're writing the compiler here
so the only person who puts anything on the stack is us. So we know where everything on the stack is so we can pull it off if we want. This is part of deoptimization. In our compiled case we use the call stack and in our interpreter case we use the full objects. When we deoptimize as well as jumping into the interpreter we take all the values off the stack and put them really into their objects.
This means we can support .binding effortlessly because we can use local variables C style local variables on the stack for all our Ruby local variables but at any point we can take them off the stack and use them again. And anyone who's using that local variables on the stack will be deoptimized as well
and they'll start using the object which represents these local variables as a Ruby object. There's a few other techniques which we can talk about using deoptimization which we can relate to in the other example. So something like object space the problem normally is that you can't store your Ruby local variables on the stack because you need to get access to them as objects
like we're doing .binding we solve it in exactly the same way. We deoptimize everyone we tell everyone that whatever local variables you've stored on the stack I want you to put them into real objects tell everyone to do that and then you can enumerate all these objects and then there's nothing stopping you using the stack for local variables because you can simply put everything back into objects later on.
SetTraceFunc we can implement very much like monkey patching so if we have a call to SetTraceFunc on every single line we simply make that a method which does nothing and we inline it and then instead of checking whether someone's installed a new SetTraceFunc all the time we deoptimize them if we install a SetTraceFunc so to install a SetTraceFunc
you tell everyone to stop you make them all eat cake they all become large again and they all do the explicit checks for a SetTraceFunc so this means these features like object space and SetTraceFunc don't have any overhead so you've been told throughout your Ruby careers that monkey patching has an overhead checking for overflow has an overhead SetTraceFunc has an overhead object space has an overhead
and they simply don't a check for overflow has one little tiny overhead which is the jump on overflow but apart from that these features literally have no overheads whatsoever in our implementation of JRubyPlusTruffle we have them enabled all the time and if you did switch them off it makes no difference to the machine code generated at all
so how do we support de-optimization? well there's only three things we need to be able to do we need to recreate the interpreter stack frame we need to jump from the jit code back into the interpreter and we need to allow us to force threads to do this now the first two are pretty simple
recreating the interpreter stack frame as we said each thread knows what's on the stack because it put it there the compiler knows what variables it put into which stack locations so we can simply tell it go through your listing of what you put where and put it back onto the stack jumping from the jit code into the interpreter again it's pretty simple it's not much more than a go-to it goes from this compiled code to the interpreter code
the stack's already there and you can keep excusing but I'll explain how we force other threads to do this in a bit more detail because this is pretty interesting and it's pretty unique in modern VM implementations so say you've got this code loop do a is 14 b is 2 a plus b now that's a tight inner loop
so we'd like to compile that so it has no checks on the method add there we'd like it to keep running without any of these checks in what some of the implementations of Ruby do at the moment is they check a flag on every time they loop should I deoptimize at this point so if you wanted to redefine add or if you wanted to raise an exception in this thread
they have to keep checking this flag again and again and checking that flag takes this from something which is a typed operation which just uses local variables and probably just uses registers to something which accesses shared memory and that's got implications for things like concurrency for your caches, everything like that suddenly it's gone to a shared memory operation
where it was really simple before so we could implement it like this we could say deoptimize if someone set this flag what we actually do in the JVM is we use something called a safe point this is a fantastic trick what we do on each loop is we read a location in memory that's all we do, we simply read it when we want it to deoptimize
what we do is we change the permissions on that page of memory to cause a segfault so what we effectively do is we cause your thread to crash and then from that point we can reconstruct the state and keep going so effectively when you use a safe point when you redefine a method when you use thread place, raise what we do is we cause your thread to crash by causing something very much like a null pointer exception
and that's really simple to do it's simply one instruction, read these safe points as we call them are already in your code because they're necessary for GCs you may have heard about garbage collectors doing a stop the world collection well to stop all the threads you need to tell them to stop somehow so we have these safe points which allow us to do that
by changing these page permissions so we've already got these safe points in this is how this can be zero overhead because we already have these safe points to support GC we can reuse them to do method redefinitions set trace func, object space, stuff like that so all modern implementations of Ruby
apart from RVMRI use deoptimization to some extent more or less but JRuby plus Truffle uses deoptimization more pervasively and more aggressively than any other implementation of Ruby and actually more aggressively than a lot of other implementations of other languages such as the V8
such as Spider Monkey and systems like that so we're presenting the implementation of Ruby here how do you know I don't just have a toy implementation which doesn't really implement Ruby and that's a really legitimate concern I know in the past we've had some questions raised about other implementations of Ruby how much of Ruby they actually implement when they start talking about performance so I'm just going to take a couple of minutes to convince you
that we have a real implementation of Ruby here first of all we support 86% of the Ruby spec language specs that's provided by the Ruby spec project we're very grateful for them providing that code so the general language features we do support the vast majority of them and we're closing the gap on getting to 100% pretty fast
we don't support the same level of specs on the core library but that's something that's just a case of implementing more libraries it's not a case of new language features if there's a language feature which someone thinks will impact our performance and we haven't implemented yet we'd love to know about it and we'll stop everything and we'll implement that language feature right away
so we've used things such as Charles Nusser's list of difficult things to implement before you start talking about performance we've gone through them systematically if anyone else has anything else they think we should implement before we start talking about performance we will do we implement all these difficult parts of Ruby method invalidation dynamic programming such as send, method missing, stuff like that
binding, threads we have initial support for C extensions frame local variables, object space regular expressions, encodings, eval concurrency, debugging, closures promotion, all those things and actually some of those we implement better than the other implementations of Ruby so as I said our object space is always on our set trace func is always on
we have a debugger which is always on you can always stop our programs introspect them using a shell and then continue running at the same performance our support of threads is a little bit limited in that it only provides concurrency at the moment we have a global interpreter lock we actually recently took on a PhD student to work full time looking at that so we hope to have some progress on that soon
our support for C extensions is also pretty minimal at the moment we're starting to look at making that more featureful but we run them much much faster than any other implementation of Ruby we actually interpret C code when you have a C extension and we interpret it using the same system we use to interpret Ruby
the big question you're going to ask is no, we don't run Rails yet and we're not likely to run Rails for another year or so probably so no one here is going to be able to use JVPlus Truffle to do anything at the moment and we're not proposing anyone should try and do it, it's a research project that's all it is the rest of the JV community are working on other implementation techniques in Ruby
which are relevant to what we need today this is a research project and we don't run Rails however we are working towards it do I think that we'll actually provide any performance increase to Rails? well let's think about what Rails does and why it makes it slow it does lots of things like dynamic programming it uses method missing, it uses send, respond to and stuff like that
and it also uses lots of little objects all the time well actually JVPlus Truffle is awesome at that sort of problem if you have lots of objects we can allocate them on the stack instead of on the heap because we can always go back to the heap if we need them JVPlus Truffle can inline through method missing, through send, through respond to because it can always de-optimise and go back to the full interpreter if it needs to
so actually I think there's lots of promise for Rails Rails isn't something that might trip us up I think Rails is something that will be surprisingly effective so let's talk about performance as I said the blog post backs this stuff up you can recreate our experiments if you'd like to please do so first of all you can pair it against some classic benchmarks
these are the benchmarks that language implementers use, computer scientists use so they're very well understood and the reason we use them is because everyone in our community understands them but they're not represented in real Ruby programs but we'll start with them so on something that's highly numerical and highly computation intensive such as Fancor Redux or Mandibrot
Mandibrot is a fractal generator we can be up to 35 times faster than MRI and JRuby and Rubinius are only about twice as fast on that sort of highly computationally intensive stuff I also show on this graph Topaz which is the PyPy implementation of Ruby which was started a couple of years ago now it's gone quiet, I don't think there's any recent development on that
but we can compare against Topaz as well having Topaz is really good actually because it shows that our numbers aren't absolutely out of nowhere Topaz is achieving something vaguely similar on some benchmarks we don't do any better than anyone else really such as binary trees and Py digits these are very much memory bound benchmarks
and we can't allocate memory faster than anyone else magically so we're limited on those those are synthetic benchmarks as I said they're not really representative of any Ruby code so let's talk about some real code we've been using two Ruby gems over the last few months chunky.png and psd.rb chunky.png is an implementation of the PNG file format in pure Ruby
psd.rb is an implementation of the Photoshop file format in pure Ruby this is production code written to make money for a business it's real Ruby code that people are using in production today we're very grateful to the authors of these gems for writing this code in the first place
on these real world gems we took the kernels of them on average we are 10 times faster than MRI and JRuby and Rabinius are no faster than MRI on these that's because these are highly computationally intensive work this is about decoding images it's about applying filters to images
if you call malloc in one of these benchmarks to allocate something like a temporary array you've already lost so you need to have de-optimization to be able to remove those allocations you need to have de-optimization to have highly efficient implementations of the basic operators like add and multiply and if you don't have those you're never going to be any faster than MRI
and again we're also a lot faster here than Topaz that's because our ability to de-optimize is well above what Topaz can do at the moment I'll pick out a few benchmarks and show where this 10x figure comes from so if we look at some example benchmarks on some of them we can be up to two orders of magnitude faster so chunky operations compose
takes one image and compose it on top of another if there's no filter there it's effectively doing memory copy and if you've got de-optimization you can get rid of all the temporary intermediate arrays and hashes and method calls and the monkey patching that goes into that and you can simply write pretty much the same that you'd get if you wrote memcpy and C
clamp is a really good example it clamps a value between two extremes so sort of min, max and they implemented this in this library by creating an array with the values in sorting it and taking the middle value and with de-optimization we can actually allocate that temporary array on the stack and then we can trace the values simply through
as registers all the way through to where they're used so this is the sort of thing Rails does it allocates little tiny arrays little tiny strings, keeps using them if we can allocate those away we can do much better than our implementations can so I think that de-optimization is the single most important
optimization there is in Ruby if you apply de-optimization you can get rid of everything that makes Ruby slow you can get rid of the overflow checks you can get rid of the monkey patching you can get rid of the temporary data structures which clog everything up any implementation which wants to look at creating a high performance version of Ruby and Max is talking about making a JIT for MRI
if you want it to be high performance then you need to look at using de-optimization and de-optimization is the key optimization to make it's not simple to implement though we're only able to use it because we've got the massive power of the JVM behind us implementing it for a simple implementation I understand is pretty much not as tractable JVM plus truffle builds on
a lot of other projects so we're very grateful to JRuby for taking us into their repository and we use their parser implementation regular expressions, implementation strings everything like that of course we depend on Ruby itself for the definition of the language and we're starting to use the core library implementation from Rubinius so JRuby plus Truffle is taking what's best from the other implementations and is bringing it together with a really high performance
JIT core the other thing Truffle depends on is the team which make the Graal compiler at Oracle and JKU University in Linz so there's a lot more people that work on it than just me here that's my Twitter handle, as I say if you want more information about this I've written a really detailed blog post so if I've covered over anything today
if I've given you a simple summary of what we're doing all the hard details are there and you can rerun those benchmarks for yourself you can look at the performance yourself and as I said it is just a research project you won't be able to buy it anytime soon thanks very much, does anyone have any good questions?