Just-in-time compiling Java in 2020
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 490 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/46965 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSDEM 202091 / 490
4
7
9
10
14
15
16
25
26
29
31
33
34
35
37
40
41
42
43
45
46
47
50
51
52
53
54
58
60
64
65
66
67
70
71
72
74
75
76
77
78
82
83
84
86
89
90
93
94
95
96
98
100
101
105
106
109
110
116
118
123
124
130
135
137
141
142
144
146
151
154
157
159
164
166
167
169
172
174
178
182
184
185
186
187
189
190
191
192
193
194
195
200
202
203
204
205
206
207
208
211
212
214
218
222
225
228
230
232
233
235
236
240
242
244
249
250
251
253
254
258
261
262
266
267
268
271
273
274
275
278
280
281
282
283
284
285
286
288
289
290
291
293
295
296
297
298
301
302
303
305
306
307
310
311
315
317
318
319
328
333
350
353
354
356
359
360
361
370
372
373
374
375
379
380
381
383
385
386
387
388
391
393
394
395
397
398
399
401
409
410
411
414
420
421
422
423
424
425
427
429
430
434
438
439
444
449
450
454
457
458
459
460
461
464
465
466
468
469
470
471
472
480
484
486
487
489
490
00:00
Java appletData storage deviceRow (database)Electronic mailing listMultiplication signWave packetCompilation albumCodeVirtual machineInsertion lossCASE <Informatik>Computer animation
00:30
CompilerCollaborationismJava appletMach's principleCompilerClient (computing)Compilation albumMathematical optimizationJava appletCompilation albumAddress spaceMachine codeMathematical optimizationServer (computing)BytecodeClient (computing)Multiplication sign1 (number)Level (video gaming)Open setCompilerInsertion lossWorkstation <Musikinstrument>Computer animation
01:14
Server (computing)Client (computing)CompilerCompilation albumMathematical optimizationEscape characterLevel (video gaming)Mathematical optimizationCompilation albumPowerPCLoop (music)Default (computer science)Workstation <Musikinstrument>Mathematical analysisIncidence algebraPower (physics)Office suiteComputing platformComputer animation
01:53
CompilerJava appletMathematical optimizationServer (computing)Client (computing)Compilation albumCompilerMathematical optimizationSlide ruleEscape characterDefault (computer science)Different (Kate Ryan album)Mathematical analysisOracleCompilation albumNetwork topologyFormal languageComputer animation
02:51
ArchitecturePersonal digital assistantAbstract data typeJust-in-Time-CompilerCompilerServer (computing)Client (computing)Compilation albumConfiguration spaceDefault (computer science)InformationPoint (geometry)Different (Kate Ryan album)2 (number)Metropolitan area networkDefault (computer science)Asynchronous Transfer ModeCompilerBenchmarkCASE <Informatik>Computer configurationLine (geometry)Open setCompilation albumRun time (program lifecycle phase)TwitterSpeicherbereinigungVirtual machineCartesian coordinate systemInterpreter (computing)AdditionCombinational logicEnterprise architectureWorkloadScaling (geometry)Multiplication signProfil (magazine)Form (programming)Machine visionRight angleSelectivity (electronic)Library (computing)Greatest elementRevision controlWeb crawlerNumberTunisCue sportsWorkstation <Musikinstrument>BitProcess (computing)Medical imagingOrder (biology)Perfect groupPairwise comparisonLevel (video gaming)Phase transitionArmSlide ruleMultitier architectureJust-in-Time-CompilerFlagLetterpress printingServer (computing)Client (computing)Java appletInterface (computing)Computer animation
08:29
BootingDefault (computer science)Interpreter (computing)Compilation albumConvex hullLine (geometry)Event horizonThread (computing)Stack (abstract data type)Loop (music)Multitier architectureCompilerVirtual machineBefehlsprozessorCASE <Informatik>CompilerThread (computing)Well-formed formulaCompilation albumCountingDefault (computer science)BitWorkloadLibrary (computing)NumberLetterpress printingInterpreter (computing)Point (geometry)Limit (category theory)Virtual machineMultitier architectureLevel (video gaming)CodeFlagProfil (magazine)Set (mathematics)Queue (abstract data type)Configuration spaceDifferent (Kate Ryan album)Mathematical optimizationBefehlsprozessorRevision controlTask (computing)Loop (music)2 (number)Uniform resource locatorFrame problemPrice indexCombinational logicMathematicsBit rateNeighbourhood (graph theory)TwitterState of matterWorkstation <Musikinstrument>Structural loadOrder (biology)ForestSlide ruleOffice suiteMaxima and minimaWhiteboardReduction of orderServer (computing)Computer animation
14:08
Thread (computing)Virtual machineBefehlsprozessorCompilerRead-only memoryThread (computing)Semiconductor memoryType theoryCompilerSet (mathematics)NumberMultiplication signDynamical systemDefault (computer science)AdditionVirtual memoryMaxima and minimaComputer fontWave packetOracleSimilarity (geometry)
15:48
CompilerRead-only memoryThread (computing)CompilerPointer (computer programming)Independent set (graph theory)Correlation and dependenceMathematicsHill differential equationJava appletMemory managementClient (computing)Server (computing)Pairwise comparisonSemiconductor memoryWave packetNumberLibrary (computing)AuthorizationLengthElectronic mailing listCompilation albumDenial-of-service attackCompilerType theoryProcess (computing)Point (geometry)Multiplication signComputer configurationMemory managementVirtual memoryOrder (biology)Reverse engineeringCache (computing)CodeThread (computing)FlagCartesian coordinate systemJava appletTracing (software)Source codeComputer animation
18:06
Memory managementJava appletServer (computing)Client (computing)CompilerRead-only memoryCompilerLibrary (computing)Computer-generated imageryProxy serverConcurrency (computer science)Memory managementJava appletThread (computing)CompilerCache (computing)CodeGoodness of fitSemiconductor memoryClient (computing)Integrated development environmentMaxima and minimaBlock (periodic table)Resource allocationCountingCommunications protocolMedical imagingEvent horizonMultiplication signVirtual memorySound effectMathematicsLimit (category theory)Server (computing)Cartesian coordinate systemExpert systemRegular graphMereologyMechanism designAdaptive optimizationTable (information)Variable (mathematics)Multitier architectureProcess (computing)Network topologyPower (physics)Computer simulationSweep line algorithmGroup actionGreatest elementUniform resource locatorCASE <Informatik>Point (geometry)Compilation albumComputer animation
23:39
Point cloudFacebookOpen source
Transcript: English(auto-generated)
00:05
So hello, everyone, and welcome to my talk about the trust and time compilers in the OpenJDK. I'm Martin Durr, and I work for SAP. We are a small team working on the JVM at SAP in Waldorf,
00:21
and this time we have three talks in a row, so we will also see two of my colleagues later on, but let's get started with the agenda. So I will talk about the trust and time compilers, which translates Java bytecode into machine code in the OpenJDK. I will also talk about how different compilers
00:42
work together, and also I'd like to address resource usage as well. So first of all, how many compilers do we have in the OpenJDK? We've already heard about two ones, but how many do we have? We have one more, three ones.
01:02
So the one we haven't heard today yet is the client compiler, which is also called C1. It compiles pretty quickly, but with a lower optimization level, and then we have the server compiler, also called C2. We already had a talk about that one. It's kind of the opposite.
01:22
It compiles slowly, but therefore with a high optimization level. For example, it has a lot of loop optimizations, and we've also heard about the escape analysis, and there are still people working on improvements for that. Both compilers are available on a lot of platforms,
01:43
including PowerPC and S390, which are supported by our team, and these two compilers are used by default. And we've also heard about Graal, which is rather new in the OpenJDK. It is still experimental in the OpenJDK. That means it is not used by default.
02:02
You need to switch it on if you want to use it, and it is developed on GitHub. So updates get merged into the OpenJDK. Special to Graal is that it is written in Java. That is a big difference to the other two compilers,
02:22
and that also does a lot of optimization. It has a more sophisticated escape analysis, for example. Andrew has already shown a few things about Graal, so thanks for that. And it is optimized for dynamic languages. By the way, Graal compiler is also called GraalVM compiler,
02:44
and I'd like to show a few things Andrew has already mentioned. I got one slide from Oracle. So thanks to Oracle for providing it. There are three different use cases of GraalVM, and the Graal compiler is always in the center of it,
03:02
together with the JVNCI, the Java Virtual Machine compiler interface. And the use case on the very left is the one which is available in the OpenJDK. So you have a Java application, or Java methods which get compiled by Graal, and they run on the hotspot VM,
03:22
the Java Virtual Machine of the OpenJDK. So that path on the very left is supported by the OpenJDK. And in addition to that, on the right-hand side, you can see the native image technology, Andrew also already mentioned.
03:41
And everything gets compiled, pre-compiled, and there's something in between, where only the Graal compiler is pre-compiled into a shared library. So that's the basic difference between this approach and this one. Your Graal compiler itself is pre-compiled in a shared library.
04:05
So back to the different compilers. I'd like to compare performance a little bit. By the way, this is an old benchmark with an old JDK and an old garbage collector, but don't care about numbers.
04:20
I think it's good to get a first impression about the performance of the different compilers. So at the bottom, you can see an interpreter for reference. So that means we are not using any just-in-time compiler, and you can get that by specifying the runtime option
04:40
minus X int. That stands for interpreter mode. So you will only use the interpreter, no JIT compilers at all. And as you can see, the performance is pretty poor. Already much faster is the C1, the client compiler.
05:00
You can select that, for example, by using this flag, tiered stop at level three. That might sound a little bit complicated, and I have to note that level three means that C1 still performs profiling.
05:20
So you won't get the best performance out of C1 by that. If you want to tune C1, you would select stop at level one, and then you would get C1 without profiling. But in this case, I want the better profiling information. That's why I left it on.
05:40
If you want to use the C2 compiler only, you can switch off JIT compilation. And then you get the blue line, which is already much faster. And the default configuration uses tiered compilation, and you get the fastest startup and the best peak performance.
06:00
The best peak performance also because the profiling information is better. But I'll explain the tiered compilation stuff later on in more detail. So you should be able to understand it better at the end. But for those who hate this old stuff, I have also a slide with the latest JDK.
06:21
So the same old benchmark with the latest JDK 15. And you can already see that the peak performance is better, with C2 especially. And you can also see the green line, which is new, that is Graal. In order to use Graal, you need to use this switch,
06:41
use JVMCI compiler, which is an experimental option. So you need to unlock it in addition. And Graal is the default JVMCI compiler. So you will get Graal by this flag. So peak performance of Graal is good.
07:00
Even for this very traditional workload, Graal performance is, of course, better for modern workloads. For example, if you run Scala, that's what Twitter does a lot. And I should also mention that the OpenJDK only contains the community edition of Graal.
07:24
There's also an enterprise version available, which contains more optimizations. So you will get better performance with the enterprise edition. And you can also see that the startup takes longer. It takes a couple of seconds, until here, 4.5 seconds,
07:43
roughly, to get peak performance. And that's due to the fact that Graal itself is written in Java. So the Graal compiler itself gets interpreted at the beginning. And then later on, hot methods get compiled by C1. And later on, they get compiled by which compiler?
08:04
The Graal compiler itself. So Graal compiles itself. And that takes a few seconds. So this may be okay for large server applications, where you can afford spending a few seconds. But there's also a possibility to fix that if you need a quicker startup.
08:23
And that's available with Graal VM. So the TreVM has a feature called print flags final. And if you enable that, you will see all flag configurations. The VM sets for itself. And you can also find, use TreVM CI native library.
08:46
And with Graal VM, that one is true by default. And that means the TreVM is using the pre-compiled shared library. So the Graal compiler's already pre-compiled. And you get a pretty good startup with that.
09:04
So next, I've promised to explain tiered compilation a little bit. So tiered compilation is basically the answer to the question of how these different compilers work together. As already mentioned at the beginning, everything starts at the interpreter, which is tier zero.
09:23
And then we have three different tiers for the C1 compiler. Tier one is C1 without any profiling. That is used only for trivial methods. When the C1 believes it's not worth optimizing further, so we will stick on this trivial compilation.
09:44
And then there's tier two. C1 uses reduced profiling. And it does that when it thinks there's too much work to do. So we just should make it quick. And the default tier for C1 is tier three.
10:02
And you get the full profiling code compiled into the compiled method. And then finally, the tier four is for the highest optimization level. And it uses C2 compiler by default in OpenJDK. And you can replace it by Graal
10:21
if you enable it explicitly. You can also see the tiers when you enable print compilation. You can see which method gets compiled at which tier. And typically, most methods get started at tier three.
10:42
Then you get also tier four methods compiled by C2 in this case. But here's also a picture to explain that a little bit more in detail. Everything starts in the interpreter as already mentioned at the beginning. And the interpreter performs invocation counting.
11:02
And once the invocation count of a method reaches a certain level, then a compile task gets generated in the C1 compile queue. A C1 compiler thread can pick it up and create a C1 compiled method, which is a tier three method in this example.
11:21
And as already mentioned, tier three also does profiling, which includes invocation counting. So this compiled code still does invocation counting. And once a compiled method reaches this level, then a compile task gets generated in the C2 compile queue.
11:45
And similar to C1, a C2 compiler thread can pick it up and create the fastest version of the method. And this is how it works for method invocations.
12:00
But there may be long running loops, which without any method invocations, and obviously the invocation counting will not help in that case. That's why there's also back edge counting, which works similar. So it's almost the same slide, but here with back edge counters instead of the invocation counters with different limits.
12:24
And what happens here, the compilers generate so-called OSR methods, which stands for on-stack replacement. And they are special methods. They have an entry point for the loop.
12:42
And on-stack replacement is called this way because an interpreted method gets removed from the stack frame, and it gets replaced by a compiled stack frame. That's why we call it on-stack replacement. So I've already talked about compiler threads.
13:03
How many compiler threads are we using? Well, that depends on the machine we are running on. In the office, I have a 40 CPU Linux machine. And when using print flags final, I can see that the VM selects CI compiler count to 15.
13:22
That is computed by a fancy formula. And one third of them are reserved for C1, and the remaining 10 in this case are reserved for C2 threads. And similar to compiler threads,
13:43
the VM also decides on how many GC threads to use, which is 28 on my machine. And obviously, these numbers are pretty high for simple workloads. When you just do trivial things with your JVM, you don't need so many threads.
14:00
We already heard this morning that threads are expensive, so we usually don't want that. And that's why we have implemented a new feature that was contributed by us. It's called dynamic number of compiler threads. We already shipped it with JDK 11, so it's not brand new, but it's the first time
14:20
it is shown at a conference, I believe. And what we do by this new feature, we interpret these numbers as maximum numbers. So we start up to 15 compiler threads. And I'll get back to that later,
14:42
but we start one thread of each type at startup and additional threads only on demand. There's a similar feature called dynamic number of GC threads, which was already implemented by Oracle, which has switched it on by default with JDK 11.
15:03
And with that, you get, of course, much, much lower resource usage. It's still possible to switch these features off to get the old behavior. So all compiler and GC threads get started
15:20
at the VM startup. I have tuned all the memory settings to very low sizes. So the JVM should actually not use a lot of memory, but you can see virtual memory is pretty high here. And that's because of the threads, they reserve a lot of virtual memory
15:42
or they occupy virtual memory on Linux due to the glibc. And if you don't switch off these new features, you can see we get a much lower virtual memory usage. It's from six gigabyte to 1.5 down.
16:03
But it's not only about virtual memory. We, of course, also save other resources. But you can trace compiler threads also by this flag. It's a diagnostic flag, so you need to switch it on
16:22
to enable, to unlock these options. And as already mentioned, you can see that the JVM starts initially one compiler thread of each type, so which is one C2 and one C1 thread. And they get kept alive for the whole lifetime
16:41
of the JVM. And the other threads only get added on demand. That depends on the compile queue length and also on the available memory and code cache space, which is available because we don't want to mess up things when the memory is already full.
17:00
We don't want to start any further threads. And once these compiler threads don't have any work left to do, they will die after some time. And they die in the reverse order they were generated. So we don't have any gaps in the compiler list.
17:21
So that's the feature we are already using. And one remark on the memory usage of the compilers. C1 and C2 compilers, of course, use native memory. And in comparison to that,
17:41
the Graal compiler uses Java heap. So that may be an issue because your Java application uses the same heap, and you may need to select to configure a larger heap with the X and X flag. Otherwise, you may get out of memory issues.
18:03
And it is also solved by using this shared library because that uses a separate heap, which is part of the native image technology. So it doesn't use the regular Java heap,
18:21
which you want to use for your application. So that's already it. What I wanted to tell, maybe a few remarks. It is also possible to configure the compiler threads to use lower memory. For example, you can tune inlining. But of course, that may have performance implications.
18:43
And it is also possible to set a node limit for the C1 compiler that will make it smaller or will limit the memory it uses. But of course, that has always side effects. So I wouldn't recommend that in general.
19:00
So I'm sure we have time for questions left. Excellent. Any questions? Yeah. We need a microphone. I see, still here.
19:24
I was just wondering what the compiler thread count and heap size, or virtual memory size, or whatever sizes look like when you force tier one, when you only run with C1.
19:42
I would assume it's fewer threads and less heap, but I don't, you didn't cover that. The virtual memory issue is due to the malloc arenas from Treelip C. And the first allocation already occupies 128 megabyte block of heap.
20:01
It's not really used. It's only virtual memory. So in most cases, it's not really a problem. But that is independent of which compiler it is or which thread it is. It also happens with Java threads or with any other thread. There's also another way to fix that. You can configure the Treelip C to use less malloc arenas.
20:26
There's a malloc arena max environment variable, and you can limit the memory by using that. That may have impacts on other performance things, because if you have many native threads
20:42
which perform a lot of concurrent mallocs, you may get issues with that. But for the JVM itself, it works pretty well. We have tried that. We have experimented with using only one malloc arena, and the JVM itself still works quite okay because it has its own memory management,
21:02
and we are not so much using many concurrent mallocs, small concurrent mallocs. Good question, actually. Thanks. Further questions? Yes.
21:32
So I have a question. For the server compiler and the client compiler, the code cache is managed by the sweeper.
21:41
Is the same mechanism implemented for the Graal VM? The sweeper has a separate thread, so it's no longer a part of the compiler threads. So I'm not aware of any relationship between Graal compiler and the sweeper.
22:00
Maybe Andrew has a few thoughts about that. I'm not sure. No. So the short answer is?
22:20
I'm not absolutely sure about that. So I wouldn't want to say unless I could be sure, but I do know that there's a change made to allow external code segments not in the original code cache, and they're wrapped with a stub that points to them. So Graal is managing some of its own memory, I think,
22:41
and I'm not sure how that gets reclaimed, but Graal does know about deoptimization events. Maybe also there's a way they can find out about the fact that something has been released, and there's a release protocol. I just don't know for sure. Okay, thanks. I'm not really a Graal expert. I've worked a lot on C1 and C2, but not so much on Graal.
23:03
But related to the code cache, there was a significant change back in the past. We only had the sweeper run by the compiler threads, and in the meantime, we have a dedicated sweeper thread.
23:23
More questions? So I think we're done. Thanks, everyone, for your attention. Thank you.