Stack allocation in the Hotspot C2 JIT compiler
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Alternativer Titel |
| |
Serientitel | ||
Anzahl der Teile | 490 | |
Autor | ||
Lizenz | CC-Namensnennung 2.0 Belgien: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/46906 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
00:00
BetriebsmittelverwaltungCompilerJust-in-Time-CompilerCompilerGlobale OptimierungComputeranimation
00:21
CompilerRechenwerkCoxeter-GruppeBenchmarkBetriebsmittelverwaltungStrom <Mathematik>Kontextbezogenes SystemGlobale OptimierungDatenfeldObjekt <Kategorie>DruckverlaufBeanspruchungGlobale OptimierungBenchmarkCompilerResultanteE-MailSystemaufrufOnline-KatalogDruckverlaufSampler <Musikinstrument>StrömungsrichtungBetriebsmittelverwaltungAppletRahmenproblemPatch <Software>Objekt <Kategorie>MathematikProgrammierungGruppenoperationTUNIS <Programm>Prozess <Informatik>DatenfeldQuaderArithmetische FolgeMailing-ListeURLComputeranimation
02:03
BetriebsmittelverwaltungObjekt <Kategorie>AnalysisCompilerImplementierungEliminationsverfahrenSkalarfeldStellenringVerschlingungGruppenoperationMereologieKontrollstrukturGanze ZahlCachingMusterspracheShape <Informatik>AnalysisCodeDatentypZeichenketteProgrammierungVariableSynchronisierungGanze ZahlPhysikalischer EffektGlobale OptimierungBenchmarkBildschirmmaskeBitCompilerGeradeInverser LimesLastLoopMaßerweiterungMereologieRechenschieberSkalarfeldSpeicherverwaltungStellenringZahlenbereichDatenflussQuick-SortVersionsverwaltungSpannweite <Stochastik>Nichtlinearer OperatorCASE <Informatik>NormalvektorMetropolitan area networkDatenfeldAdditionPunktStrömungsrichtungKlasse <Mathematik>QuaderThreadShape <Informatik>Primitive <Informatik>EliminationsverfahrenBetriebsmittelverwaltungAppletSoundverarbeitungPuffer <Netzplantechnik>Maskierung <Informatik>DifferenteSystem FObjekt <Kategorie>Front-End <Software>Fahne <Mathematik>MultiplikationsoperatorMessage-PassingRechter WinkelMusterspracheGamecontrollerEinsComputerspielHalbleiterspeicherBildschirmfensterSensitivitätsanalyseProgrammiergerätAuswahlaxiomForcingPotenz <Mathematik>Zentrische StreckungVerzweigendes ProgrammProzess <Informatik>ProgrammfehlerZeiger <Informatik>KontrollstrukturInformationsspeicherungKreisflächeKonditionszahlCybersexFreewareURLComputeranimation
08:30
KontrollstrukturAnalysisNotepad-ComputerMaskierung <Informatik>MakrobefehlWärmeausdehnungBetriebsmittelverwaltungKeller <Informatik>ImplementierungMinimumObjekt <Kategorie>Wurzel <Mathematik>ProgrammschleifeGlobale OptimierungAnalysisStabGlobale OptimierungInverser LimesLastMereologieSpeicherverwaltungStellenringWärmeausdehnungZellularer AutomatZentrische StreckungAutomatische HandlungsplanungSpannweite <Stochastik>CASE <Informatik>Coxeter-GruppeBenutzerschnittstellenverwaltungssystemDatenfeldPunktWort <Informatik>AppletInelastischer StoßQuellcodePlastikkarteMaskierung <Informatik>Objekt <Kategorie>URLRechter WinkelCodeImplementierungDynamische OptimierungMakrobefehlMaßerweiterungSkalarfeldSystemaufrufÄhnlichkeitsgeometrieWurzel <Mathematik>Zeiger <Informatik>ExistenzsatzFeuchteleitungQuaderProgrammschleifeBetriebsmittelverwaltungRahmenproblemKeller <Informatik>Computeranimation
11:17
Wurzel <Mathematik>Objekt <Kategorie>BetriebsmittelverwaltungKeller <Informatik>SpeicherverwaltungVererbungshierarchieDatenkompressionArray <Informatik>Zeiger <Informatik>CodeDämpfungIterationSoftwaretestInverser LimesObjektorientierte ProgrammiersprachePaarvergleichProjektive EbeneResultanteSpeicherverwaltungStellenringQuick-SortSpannweite <Stochastik>CASE <Informatik>Wurzel <Mathematik>ATMDatenfeldVererbungshierarchieKlasse <Mathematik>QuaderWeb-SeiteArray <Informatik>ThreadEliminationsverfahrenBetriebsmittelverwaltungAdressraumSoundverarbeitungMaskierung <Informatik>DifferentePatch <Software>Objekt <Kategorie>MultiplikationsoperatorURLKeller <Informatik>Minkowski-MetrikRechter WinkelAnalysisBildschirmmaskeBitCompilerLoopMomentenproblemGrundsätze ordnungsmäßiger DatenverarbeitungSystemaufrufCoxeter-GruppeMetropolitan area networkBenutzerschnittstellenverwaltungssystemZeiger <Informatik>PunktOffene MengeInformationsspeicherungKartesische KoordinatenZweiComputeranimation
15:26
BenchmarkBetriebsmittelverwaltungKeller <Informatik>GraphBitCompilerPrototypingSpannweite <Stochastik>Coxeter-GruppeUltraviolett-PhotoelektronenspektroskopieKartesische KoordinatenObjekt <Kategorie>MusterspracheStapeldateiOnline-KatalogProzess <Informatik>Betriebsmittelverwaltung
16:09
Patch <Software>PrototypingMigration <Informatik>CompilerCodeProzess <Informatik>GruppenoperationATMGlobale OptimierungEinfach zusammenhängender RaumCoxeter-GruppeCodeProdukt <Mathematik>BenchmarkCompilerMaßerweiterungE-MailQuick-SortElastische DeformationAutomatische HandlungsplanungProzess <Informatik>Coxeter-GruppeFeuchteleitungKartesische KoordinatenHilfesystemSystemzusammenbruchBetriebsmittelverwaltungMailing-ListeFramework <Informatik>Patch <Software>Schreiben <Datenverarbeitung>Keller <Informatik>ImplementierungPerspektiveInverser LimesFlächeninhaltReelle ZahlObjekt <Kategorie>Inverter <Schaltung>Dreiecksfreier GraphRechter WinkelOrtsoperatorComputeranimation
18:51
AnalysisHalbleiterspeicherSensitivitätsanalyseSoftwaretestBenchmarkInverser LimesLoopMereologieSpeicherverwaltungDatenflussSystemaufrufSpannweite <Stochastik>CASE <Informatik>FeuchteleitungBetrag <Mathematik>DruckspannungBetriebsmittelverwaltungPartielle DifferentiationMaskierung <Informatik>LaufzeitfehlerObjekt <Kategorie>sinc-FunktionSchreiben <Datenverarbeitung>Keller <Informatik>Rechter WinkelResultanteSpeicherabzugDatenfeldPunktMinimalgradMultiplikationsoperatorURL
22:19
FacebookOpen SourcePunktwolke
Transkript: Englisch(automatisch erzeugt)
00:05
Hi everyone, this talk is gonna be a compiler talk about the C2 compiler in OpenJDK, specifically a new optimization that we worked on, Microsoft surprisingly, but, so called stack allocation.
00:21
In this particular talk, I'll start with a quick introduction what currently exists in the C2, extend with the work we did in our engineering group, then show some results on popular benchmarks and hopefully kind of finish off with the stuff we still need to do.
00:40
So it's still very much a work in progress, we haven't submitted a patch to the OpenJDK mailing list, compiler list, but it's just a warning, I mean, maybe boring for a bunch of people, it's gonna be a bunch of compilers jargon here, I can't work around that, fortunately. So, stack allocation. So the main motivation was to alleviate
01:01
some of the GC pressure. It was originally brought to us, we started working on this by Kirk Pepperdine, some of you may know him, he's big on GC tuning in the Java world, and he often would say, as we're looking at some Spark workloads to improve them, and say, these allocations catalog give the GC seizures
01:22
and so on, so it's really bad, you guys should try to fix this. So how we actually gonna try doing this is by saying we're gonna eliminate the location, but not quite, we're gonna allocate on the stack frame rather than on the heap. It's a known optimization compiler, it's just somewhat not actually being done in C2 yet.
01:46
So when can we do this? Is when we say the object does not escape the current method context, and typically objects escape with returns, calls to methods, or maybe stored into fields, passed around, so there are places where we can do it,
02:02
places where we can't. So here's an example. We have a very simple Java program here with three allocations. We got box two integers, and then we finally return another one, integer is immutable class, so there's two additions, create a new object, and that's popped and returned. So the first two objects do not escape the method,
02:25
but the last one does. And the way that we do tell which ones do escape, which ones don't, is by this compiler backend pass called escape analysis. So a bit more about escape analysis. Now, escape analysis was introduced by this paper.
02:43
It was introduced by IBM TJ Watson Research a while back, and essentially they had in the paper two different kind of optimizations described, flow-sensitive one and flow-insensitive. The one that's implemented in C2 is the flow-insensitive version of it, and it's the right choice, actually.
03:02
The paper describes both of them, but it actually shows that they do similarly well, and the flow-insensitive is much easier to implement, maintain, and less memory-intensive. Currently it's used in C2 for the following two purposes. So we use it for monitor elimination, which means if the objects have proven
03:21
that they're not escaping, therefore they can only be seen by one thread. So we can eliminate the monitor enter and monitor exit operations on these objects. So no synchronization on them. So say you're using a string buffer instead of string builder, well, this will help. The second one, which was interesting for us, it was the scalar replacement.
03:41
So this is the form of an optimization that we kind of extended. So scalar placement goes by the same sort of concept. You take out the original object, you break it apart into these original parts so you actually make away the actual allocation. So the breaking up of the object
04:00
turns it into a normal autos or local variables that sit on the stack. Therefore, no heap allocation there. So here is another example. Slightly different than before, but maybe you'll know the subtle difference, people that know this stuff, but we are doing the three allocation as before. Now let's move on to the next step.
04:22
And the scalar replacement will come in and actually turn it into this. This is what the final program would actually behave like. The integer field within the integer object, the primitive data type is extracted and actually stored as a local variable on the stack. Now, when can we do this? Mainly when we actually don't need
04:41
the original form of the object anymore. So when we can actually prove that the object as a whole integer is not needed. Now, let's see some of the limitations with scalar replacement. So when we looked at the code, there's a number of reasons why scalar placement can't fail.
05:00
But when we did the analysis and ran a bunch of benchmarks and workflows to see what was the main cause of scalar placement failing in cos plus C2, it's introduction of control flow. So, namely, let's just compile a talk again, like phi over here. It's, we have the two definitions on two sides of the control flow. One is the object being instantiated,
05:23
this new class object here, my class. And the other side, we're pulling it out of an array. Now, coming down to the last return object.x, now, which one do we have in our hands? Therefore, we need the original shape of the object. We need to do a field load of that object. Because on one side, maybe scalar replace,
05:41
but the other side will need a full bloody object coming from that array. So, how common is this issue? So I'm back to my original example. So the side that we have on the left here, we can scalar replace that. But that's not what typically happens with autoboxing.
06:00
The side on the right, C2 is unable to handle. Mainly because integer value off, which is what actually happens when you autobox a primitive data type, internally has the exact same pattern I showed in the previous slide. It has a compare, actually, with two ranges, minus 128 to 127, which is also configurable.
06:23
If the values fall in this range, you're getting a pre-cached integer object from a static array, which is a poor man's version of elimination of allocations. But it does actually work for that range of objects. But every time you go above that range or beneath that range,
06:41
you get an actually heap allocated object. So, can we make this work? So there's compiler optimizations that could potentially make this little example here work. However, it has some drawbacks. One typical way we could do this is by actually cloning. Optimizations in the backend optimizations
07:01
could potentially say, well, we have a condition, we go either way, so why don't we just specialize the method body for this side and for that side? But let's say you have another branch, then it gets unwell, it becomes exponential, so code grows insanely, so it's not actually very useful. Otherwise, you can do that as maybe by code motion,
07:21
but then you're stuck with side effects. Let's say the array we're pulling out, that object, this object was a null object. So can you actually, what if you had to throw the null point exception? Don't be on the wrong line. So these kind of optimizations do have limitations in how quick, how actually often can we apply them?
07:42
Now, this is what we actually came up with, said, well, we don't actually have to scale or place it. What if we actually allocate it, but actually allocate on the stack? So the object shape is preserved, as it typically was, and it just lives on the stack just like the primitive,
08:01
which is a little bit of extra stuff. It has flags field, it has a class pointer, everything you normally would expect from an object. Now, is this useful? Well, yeah, let's consider this example. Like, you have this loop over here. I mean, I cleverly returned the primitive data type, so my object doesn't escape, but if this was the case here, this loop will keep generating new objects every time around.
08:21
Integer is immutable, therefore, new object for every new addition you do. So stack allocation. Will it work on previous example? Yeah, because on both sides, now we actually have a plane on Java object. So the object.x field load,
08:40
the get field that happens there, has no problem existing. From one side, it will read from the stack. On the other side, it will read from the object it got from the static array. All right, so how do we implement this in C2? Come to the second part of the presentation. So Charlie Gracie and myself, inspired by the words of Kirk,
09:02
started looking into implementing this with stack allocation. We had to modify escape analysis in C2 to recognize cases where we can safely stack allocate the object. Not all non-escaping objects can be stack allocated. I'll show some of the limitations later on, but there's plenty of them.
09:21
We implemented the stack allocation path in macro expansion, so we had to actually write a separate path that took out everything else. We removed everything else but the path where we stack allocated object. So how do we stack allocate? And this was one of the big revelations. We actually used box lock node, which was used for monitors,
09:41
because mainly we needed a way to communicate stack OOP, which was not done in any other way from the IR back to the co-generator to say, hey, this should be a pointer reference on the stack somewhere. So right now, our stack allocate objects end up where the lock slots would be,
10:00
which is right after all the spills and low calls before the preserve registers on the frame. So other stuff we had to actually worry about. As soon as we did that, we got immediately a search in the garbage collector, said, what the hell is this? You're giving me an OOP reference on the stack? That's not right.
10:21
So we had to actually extend GC root scanning to support these objects, because what it'll look like, there'll be a local on the stack that points to another stack location, which is doing quite sit, right? A more kind of subtle issue I'll describe later is detecting live ranges of objects in loops.
10:41
And the other two items, removing the right barriers. We obviously can't do them, because you do a card mark on a stack location. It's not good. And then the other two were already being done. Similar code we found for scale replacement. So we were able to leverage that. We had to kind of similarly implement
11:02
heapification objects on deoptimization. So any safe point that the allocation can reach, we had to inject this scalar replace alloc node, I believe it was called, where we describe which fields need to be copied over to a heap object. So here's the GC root scanning. Typically, what you normally see is now,
11:22
below the locals, you have a pure stack allocated objects. The first five over here will be a flax field, then we have the class pointer, and some reference. So the GC need to be thought that, well, as you walk in the stack, you have a reference coming over here. You need to find all the oop fields and actually mark them,
11:41
so you don't actually lose any objects. Now this overlapping live ranges was kind of like a subtle gotcha. And to be honest, I kind of knew this, but I forgot about it. I used to work in IBM on the test of the OpenJ9 compiler, and we used to do this, but sort of had to relearn it from scratch.
12:01
It's an interesting case where if these two objects, as we have them here, let's say, are stack allocated, as soon as we get into the definition, which is definition V2, where value equals result, what ends up happening is that these two addresses, which are actually addresses on the stack, become the same.
12:24
So what that definition is, coming back on the second iteration of the loop, will be the address that result was before. But result is stack allocated, it's always the same address. So now all of a sudden, what you end up doing is, well, after you first enter this if, you never enter it again.
12:42
So typically when this was a heap allocation, you will get a new address every time. You allocate from the thread local heap buffer, or you allocate from the heap somewhere, but it's a new address. So your address comparison will work, and the location on where the object is stored is different. But once it's on the stack, it's always the same, which we wanna do, we wanna actually reuse this
13:03
for the purpose of better cache, page misses, and also remove the allocation, while we end up in this problem. So we had to have code to detect this, and actually reject one of them as a candidate for stack allocation. One of them heaped, the other one stacked fine.
13:23
So we go into the current limitations. So we have a few limitations we can actually do right now. We don't stack allocate object re-monitors, it's kind of side effect with box lock node, which has been finished the work, it's not hard to do. But some of the monitor elimination code
13:40
eventually compacts our box lock slots for stack allocate objects, so we mess up. The second one is a main issue that we have with performance right now. We do not allow stack allocate objects to be pointed to each other. So obviously a heap parent would mean escaping,
14:00
so that's actually handled by escape analysis, but stack allocate is not allowed at the moment. There's ways to resolve that, but right now we don't do it. We don't have compressed support yet. And this is mainly because you can have a merge point, one side a heap object,
14:21
the other side a stack allocate object, and it goes through it in code P, gets stored as compressed in the stack. Well, compressing the stack doesn't work because you cannot guarantee that the address range will be within that 32-bit space. We don't stack allocate arrays at the moment as well. We just ran out of time.
14:41
There's no particular reason why we didn't do it. Primitive arrays would be simple, reference arrays, special consideration with array copies, so we just didn't have time to finish this presentation and come here and talk about this. And finally, thank you, Ron Pressler. We actually may need to do something special for Project Loom here,
15:02
either prevent stack allocation of objects that live across method calls, because in the mode where they do the fast relocation of the stack, it's just a simple mem copy. So if you have a reference in the stack pointing to a stack object, well, nobody's there to patch it, to update that.
15:22
So we'll get there, actually. So now some good news, actually. So these are the performance improvements that are actually got with this prototype that we have. Being a compiler guy, for me, this is amazing because I usually would work for three, four months for 2% or 3% improvement,
15:41
and having a range of applications actually get significant speed ups is actually quite good to see. One of the stack allocations, the stack allocated object would be another massive improvement if we get it right, because there's certain patterns in Scala that are very common that do have an object graph
16:00
pointing to each other, which we currently reject. And so, finally, the last bit of the presentation. So when and where can we see this patch? We're in nowhere right now, so Charlie and I are in the process of migrating our paths from JDK 11.
16:20
There's no particular reason why I picked JDK 11, just that we were looking at Spark, sort of continued down that path from the build we're using. We're migrating to TIP, cleaning up the code, and as soon as it's done, we'll actually post this to the compiler dev mailing list and ask for review. That's the plan. So our next steps, from our perspective,
16:43
are that we have to stabilize the prototype and clean it up. We still have a few crashes. We haven't looked at every of those methods or benchmarks we couldn't run, because there are issues. Started working on removing the limitations, one by one. Stack allocated, stack allocated would be probably my first pick.
17:03
Right now, we only support G1 and PGC with a heap, with a mark extension to walk stack allocated objects, so we need to extend and see how it actually works in other GC modes, like Shenandoah or ZGC. And finally, look for more opportunities
17:22
in other real-world applications. I wanna see if we can actually improve the various REST frameworks that are out there that people build with and make elastic search products like that. Which leads me to the end of the presentation here,
17:42
which is, if you like what you saw here, please stay in touch. We'd like to actually work with everyone here. Both Charlie and I are really new to the code base and need a lot of help to actually make this a reality. And helping us review the patch would be awesome, if anybody is willing to do that.
18:01
That's it. Thank you. Any questions? Do we have time? Yes, we do. Five minutes. Are you all completely stunned by that?
18:21
Not everybody, evidently. Yeah. So the question is, can you say something about your write barrier implementation? You said that it's always a performance when, but if the stack allocation fails,
18:41
then presumably you've got a more expensive write barrier now. That's right. So we currently, and that's exactly where I was going. I have actually two appendix slides, which I'm actually gonna talk about, the reference to reference issues, which leads me to the stack write barrier. So we currently remove the write barriers
19:00
on stack allocate objects, because we are sure that when we make it a candidate for stack allocation, there will be never it becoming a part of something else. So if you store a field into, if it has a field, and you write into that field, you don't need a write barrier, because nobody's ever gonna see that object. This lives on the stack. Now, the reason why we can't do stack allocation
19:22
to stack allocation is exactly this case. Let's have this example here, where we have two objects pointing to each other. Now we get to the bottom part, we load the original test object from the wrapper, we do t.x, everything's good, we remove the write barrier. Now what if there was another coin between like this?
19:43
And this actually gave us a heap loop. Now coming down here, it's either a heap or a stack object down this t1.x, so we don't know. In that case, two ways, we can actually detect this case with analysis and reject it, which would reject certain candidates, or we extend the write barrier
20:01
to actually look at the stack range and say, yeah, this falls within the stack, you're good, keep going, don't worry about that. So it would increase the cost of the write barrier if we did that approach. Very interesting results, thanks a lot.
20:21
Quick question, have you considered allocating such objects on heap instead of on stack, but just like? No, do you mean reserve a special heap region for this kind of logic? Like with T-Labs. You have to allocate T-Labs, but just for a duration,
20:42
or for a single call, or just chunk them. And by that, you can significantly simplify the requirements to the runtime. You don't need to treat special objects on the stack, since everything stays on heap. Well, I asked to consider how that would work.
21:02
I don't, can't think of top of my head, but yeah. We'll think about it, maybe there's a way that we actually can do that. We didn't consider it, no. I think there was a question here. Just looking forward to the write out the path when it comes out,
21:20
because in the memory API that I showed before, we have some benchmarks that are very problematic and stress the flow sensitive case that you were mentioning before. Yeah, especially with the immutable trend, and I love immutable objects myself. It actually trace copies every time, which we're hoping that this would actually take care of.
21:44
Charlie. So I just want to be clear that there's still a limitation that there's no partial escape analysis here. There's no lazily standing stuff back up, okay. Yeah, absolutely. So we can look into that next.
22:01
I mean, right now we're not doing it. So anytime we see something that escapes, for us it escapes. It doesn't matter if it's a cold call or something that's never reached, we're just, okay. Okay, well, I think we're done. Thank you. Thanks. Thanks a lot. All right.