Project Loom: Fibres, Continuations, and Structured Concurrency for the JVM
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Alternative Title |
| |
Title of Series | ||
Number of Parts | 490 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/46956 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSDEM 202099 / 490
4
7
9
10
14
15
16
25
26
29
31
33
34
35
37
40
41
42
43
45
46
47
50
51
52
53
54
58
60
64
65
66
67
70
71
72
74
75
76
77
78
82
83
84
86
89
90
93
94
95
96
98
100
101
105
106
109
110
116
118
123
124
130
135
137
141
142
144
146
151
154
157
159
164
166
167
169
172
174
178
182
184
185
186
187
189
190
191
192
193
194
195
200
202
203
204
205
206
207
208
211
212
214
218
222
225
228
230
232
233
235
236
240
242
244
249
250
251
253
254
258
261
262
266
267
268
271
273
274
275
278
280
281
282
283
284
285
286
288
289
290
291
293
295
296
297
298
301
302
303
305
306
307
310
311
315
317
318
319
328
333
350
353
354
356
359
360
361
370
372
373
374
375
379
380
381
383
385
386
387
388
391
393
394
395
397
398
399
401
409
410
411
414
420
421
422
423
424
425
427
429
430
434
438
439
444
449
450
454
457
458
459
460
461
464
465
466
468
469
470
471
472
480
484
486
487
489
490
00:00
FiberConcurrency (computer science)Java appletComputing platformBitComputer animation
00:25
Java appletFormal languageSoftware crackingNumberThread (computing)System programmingKernel (computing)SpacetimeLocal ringStack (abstract data type)Programmer (hardware)Series (mathematics)Event horizonComputer programmingScale (map)Function (mathematics)WeightContext awarenessTotal S.A.System callObject (grammar)Service (economics)Process (computing)Automatic differentiationDatabaseLogic synthesisCartesian coordinate systemMessage passingThread (computing)Virtual machineJava appletContext awarenessParallel computingError messageOperating systemDependent and independent variablesAddress spacePhysical systemCategory of beingComputer hardwareServer (computing)WeightLine (geometry)Representation (politics)Point (geometry)Multiplication signSlide ruleNumberResultantBitGroup actionProgramming languageComputer programSoftware crackingWordInstance (computer science)Term (mathematics)Task (computing)Block (periodic table)User interfaceCodeFormal languageComputer scienceUnit testingQuery languageImplementationKernel (computing)Core dumpSurjective functionSpacetimeMemory managementStructural loadQuicksortComputer animation
04:50
Kernel (computing)Thread (computing)SpacetimeTask (computing)Stack (abstract data type)Conservation lawCodeJava appletFunction (mathematics)Local ringReading (process)Electric currentAbstractionBlock (periodic table)ImplementationVirtual realityCharge carrierScheduling (computing)ComputerAnalytic continuationRepresentation (politics)Control flowInterior (topology)Virtual machineMilitary operationComputerCache (computing)Frame problemObject-oriented programmingAlgorithmPointer (computer programming)Parallel computingLibrary (computing)Thread (computing)QuicksortBitPhysical systemRight angleJava appletOperating systemResultantNeuroinformatikMetadataSpeicherbereinigungSocial classWordContext awarenessArray data structureGoodness of fitObject (grammar)Utility softwareLevel (video gaming)SpacetimeKernel (computing)Analytic continuationRaster graphicsStack (abstract data type)Frame problemLengthComputer programInterior (topology)Natural numberCharge carrierMultiplication signMemory managementMachine codeVirtualizationFlow separationMatching (graph theory)MathematicsAbstractionPointer (computer programming)Cache (computing)Operator (mathematics)Vapor barrierLibrary (computing)Line (geometry)Block (periodic table)Row (database)Slide ruleVariable (mathematics)Data structureTask (computing)Electric generatorWeb pageFitness functionLocal ringMereologyOcean currentProcess (computing)Video gameExistenceHypermediaCore dumpPoint (geometry)SynchronizationDynamic random-access memoryAssembly languageSystem callString (computer science)Staff (military)Metropolitan area networkOntologyEvent horizonGroup actionStatisticsOrder (biology)SequenceCodeParallel computingComputer animation
14:58
BitGreatest elementLocal ringData structureStructured programmingComputer programControl flowComputer animation
15:24
Concurrency (computer science)Thread (computing)Invariant (mathematics)AbstractionComputer programComputer programmingComputer programBitControl flowMessage passingStructured programming10 (number)Greatest elementThread (computing)Data structureWordComputer animation
16:39
Concurrency (computer science)Core dumpControl flowTask (computing)Error messageThread (computing)Block (periodic table)Data modelContext awarenessStructured programmingSystem programmingWeightLocal ringQuery languageNeuroinformatikData structurePoint (geometry)Constructor (object-oriented programming)Inheritance (object-oriented programming)Parallel computingWordGreatest elementFunctional (mathematics)Thread (computing)Physical systemResultantCASE <Informatik>Local ringService (economics)Dependent and independent variablesTask (computing)Lattice (order)Exception handlingJava appletComputer animation
19:07
Structured programmingTwin primeMenu (computing)Loop (music)Concurrency (computer science)Overhead (computing)Message passingInclusion mapDatabaseWordMathematical analysisAssembly languageLocal ringSpeicherbereinigungHash functionOperator (mathematics)Branch (computer science)Thread (computing)Standard deviationLengthField (computer science)Computer animation
20:18
Java appletComputer iconGamma functionStatement (computer science)Thread (computing)Concurrency (computer science)Context awarenessStructured programmingTask (computing)Inheritance (object-oriented programming)Variable (mathematics)Local ringSocial classFluid staticsCodeBefehlsprozessorPointer (computer programming)Cache (computing)Just-in-Time-CompilerCharge carrierVirtual realityHash functionKeyboard shortcutInheritance (object-oriented programming)Computer programStructural loadParallel computingPoint (geometry)Thread (computing)Local ringBranch (computer science)Task (computing)MultiplicationFerry CorstenMultiplication signData structureArmSystem callSchmelze <Betrieb>State observerObject (grammar)Right angleField (computer science)Constructor (object-oriented programming)Interface (computing)CASE <Informatik>Loop (music)ImplementationInvariant (mathematics)Cache (computing)Charge carrierLevel (video gaming)IntegerHash functionPointer (computer programming)Computer animation
24:33
Link (knot theory)Point cloudFacebookOpen setOpen sourceComputer animation
Transcript: English(auto-generated)
00:05
Thank you, Mark, who gave an admirable introduction to some of the motivation behind Project Loom. I'll go over that a little bit just to add a little bit more, but I'll also talk a bit about how this stuff actually works.
00:27
So, one of the marvelous things about Java, well, there were two marvelous things about Java in its early incarnation. Firstly, that there was essentially no innovation in Java. There was nothing new. It was just a practical synthesis of a whole bunch of stuff that already existed in other programming languages.
00:47
A lot of them were just research languages, though. The genius of Java was to come up with a synthesis of some of the best ideas in computer science in a practical language that could be used by normal human beings.
01:00
That was a tremendous achievement. And one of the things that Java did was concurrency multithreading, which was to actually get a portable definition of what a thread actually was, which didn't depend really on any particular hardware implementation as threads, was a considerable achievement at the time.
01:24
But as Mark said, we're 25 years down the road and the cracks are beginning to show. And servers in particular, but all sorts of Java applications, are running a lot of threads. But manual use of threads, i.e. using java.lang.thread by hand, is clumsy.
01:45
It's error-prone. You can end up with a whole load of threads hanging around, not doing much. And also, they're a very heavy weight because they require on kernel lightweight processes. Now, a kernel lightweight process in Linux is not really our idea at our end of what lightweight actually means.
02:05
Even on a 32-bit system, it can sometimes allocate a megabyte of address space just for the thread-local heap. So what we actually need, again, as Mark said, is some more lightweight representation.
02:25
We don't want the kernel to have to preempt threads. We want to do it in user space. Because those of you who have measured it will know that on any kind of Unixy Linuxy system, it takes an ungodly long time to get into the kernel and back out again,
02:43
usually about a microsecond or so. And preempting threads is quite expensive. All right, I should give Ron credit at this point. These are a couple of the slides that I stole from you, so thank you. I'm sure you recognize them. And java threads don't need a lot of this stuff.
03:02
So again, as Mark said, programmers have responded to this by using reactive programming. So you would like lots and lots of little tasks, and the task would. These all run asynchronously, and they don't block. So whenever you need to block, what you actually do is send a message out. So you will have your code, and it will send a message to the database server saying,
03:25
can I have a lookup for this, please? And the database server would then send a message to somebody else who would, in turn, respond to the result of the query. There are people who like doing this stuff.
03:42
But it's quite difficult to write. It's very difficult to understand. Debugging it is hilarious. But this was done a while back. This was done in the Smalltalk 80 user interfaces back in, well, 40 years ago, where the system would be forever sending you messages,
04:03
which you would have to handle some way. But it's difficult to do, despite it being pretty efficient. We don't need to have so many threads. We just schedule everything onto thread pools. Now, if your thread pool is the same size as the number of actual cores you have
04:20
in your physical machine, that works beautifully, because the operating system never needs to preempt any threads. So I've got a bit ahead of myself now. Oh, yes, the unit tests are difficult to maintain. So why not reduce the weight of instances of thread?
04:41
Java threads don't need very much context, really. There's certainly an awful lot less than kernel threads do. And we should be able to get the total footprint of a thread down to a few hundred bytes. This has been done in the past. I mean, there were systems going back even as far as the 1960s, I believe, with super lightweight threads to do this sort of thing.
05:02
So there is a fair bit of prior art here. But how can we bring this to Java? So, yes, there are OS threads, heavyweight megabytes. I've said all of that. This is what we can do. We can get user space threads down to a few hundreds of bytes.
05:22
The stack that gets allocated, sorry. The important thing here is that the stack for user space thread is just going to be a few hundred bytes, whereas the smallest stack that a kernel thread can possibly have is one kernel page.
05:40
And these are allocated lazily as you grow your stack. Clearly, if you can fit several thread stacks into a single page, which you certainly can. I mean, pages are typically 4K, but they may be as big as 64K. You really are winning big time. Yes, yes, yes, yes. Right.
06:03
But thread locals and thread current thread are used all over the place. So there's some parts of the API that are no longer a match for this lightweight way of working. And thread could be cleaned up by removing long deprecated methods.
06:23
But as we all know, it's extremely difficult to remove anything from Java. So we really need a better abstraction for all of this. So here we are, virtual threads. Let's implement just what we need. Let's switch between threads in user space. Let's only keep as much stack as we need.
06:42
Now, we won't be able to block in native code because that requires a full kernel task switch, but there is a way around this. There have been a few changes in the project recently, but most particularly, a virtual thread is now a subclass of Java-lang thread.
07:02
Now, when a virtual thread is running, it is mounted on a carrier thread. This is very important. So your carrier thread will just be a thread out of your usual thread pool. And when we switch virtual threads, we have to dismount the virtual thread from its carrier thread.
07:21
Yeah, these slides are in a slightly funny order. The problem is you've got your stack which is sitting on a carrier thread. This is the actual stack that's been provided to us by the kernel, and the stack is the record of execution of your Java program.
07:41
What we actually need to do is to detach the stack frames from this stack, move them into the Java heap somewhere, and then mount another virtual thread onto the same carrier thread. And to do that, we have to copy the stack. Now, when I first saw this, I was appalled.
08:04
You mean every time we dismount a thread for any blocking call at all, we're having to copy the entire stack around? How can this possibly make any sense? Shouldn't we just allocate stacks for virtual threads on the heap instead,
08:22
rather than, I mean, that's the obvious other way of doing it, rather than allocating them in the stack space? Now, it sounds like copying this stack is going to be a fabulously expensive operation, but it actually isn't. There's a couple of reasons for this.
08:41
One is that our computers are tremendously good at bulk copying. Anybody who's spent a while observing the computer programs that actually run on our computers every day will have observed that they spend most of their time just moving stuff around. That's the nature of computers. That's the nature of how they're used.
09:02
And therefore, the people who design the computers that we use have gone to extraordinary lengths to make just moving stuff around very fast, particularly with caches and particularly the accessing dynamic RAM sequentially
09:20
is very, very quick, and it does prefetching and so on. But we don't actually have to copy the entire stack when a virtual thread is unmounted. Now, all we actually have to copy are the frames that have been altered since last time we unmounted the virtual thread. The details of precisely how this works are kind of gnarly,
09:40
and this is return barriers. Thank you. If you want to give it a name. But at that point, you're starting to see why this is a cheap operation. The other thing to observe is that Java stack frames are small, and they're small because you don't have local strings,
10:02
you don't have local arrays, you don't have any of this stuff. All that is in a Java thread are your local variables, and your local variables are always either scalars or references to an object somewhere else. So the threads are small. Copying them on and off when we unmount is pretty cheap.
10:22
But we can't get away with that with native code. We can't unmount stacks with native frames. The reasons for this are quite complicated, but the problem is that native stacks often contain pointers that are pointing into the stack frame,
10:41
and we'd have to do some really fancy footwork of relocating the stack frame if we wanted to unmount a virtual thread from a stack over here and copy it to a carrier thread over there. All right, so let's say we've got
11:01
an unmounted thread over here somewhere. What do we do about object pointers? Because we know that saved on the stack there will be a whole bunch of object pointers, and the garbage collector, this is Java, is moving stuff around all the time.
11:21
How will the garbage collector be able to do that? Because if you look at the structure of this class here, Java line continuation, I should just explain. A continuation is basically just the running context of a Java program. A virtual thread is composed of its continuation
11:41
plus a bit more stuff. So when we save the stack, we just copy it into this Java array here as just an array of ints. But the garbage collector is going to want to continue to run, and it's going to move objects around, which is going to invalidate some of the pointers
12:05
in that int array that's the copy of the stack. And what we used to do was to scan the whole stack, find out all of the words in the stack that were in fact object pointers, and copy those into a separate object array,
12:25
and we'd expose that to the garbage collector, and then when we remounted the virtual thread, it would copy them all back. Now the problem with this is that actually finding out which words in the stack are object pointers
12:42
and which words are just integers is really quite an expensive operation. You have to draw through the metadata of all the methods that are on the stack. Granted, the result is just a bitmap. This word is either an object or it's not. But actually scanning and unscaling and so on
13:01
was really quite painful, considerably more painful, I have to say, than the business of just copying the data into and out of the array. But we have a new algorithm, which I think Ron implemented two weeks ago or something, where the garbage collector can actually scan what's in there as long as it stays in the new generation.
13:24
If it stays in the old generation, sorry, if the virtual thread gets promoted into the old generation, I think then we have to do the whole thing of scanning the stack and setting up the pointers and so on. I think that's Ron nodding or not.
13:40
Good enough, right? Okay, synchronized blocks. Now, those of you unfortunate enough to have actually written Java VM at the very, very low level will know that the way that synchronized blocks work is some very, very hairy handwritten assembly code,
14:03
which makes assumptions that this really is running on the native stack and you can block and you can call into the operating system and so on. We can't do anything about this with virtual thread. If you actually say synchronized and the synchronized has to block,
14:20
then you are going to block the carrier thread, which you really, really don't want to do because now you've got one fewer thread that you can use to do some work. But people are more and more these days using the locks from Java Util concurrent rather than just synchronized blocks, and these work perfectly well with Loom virtual threads.
14:42
They unmount the virtual thread if they block. So that works fine. So we've had to go through the Java IO library replacing these synchronized blocks. It only has to do it once. And thread yield hands off to continuation yield unmounting the virtual thread. All good. Now, the next bit is,
15:02
it's called possible futures here. That's not really right. The first one is structured concurrency, which I think is definitely going to happen. The second one is scope locals, which may or may not happen, but I want to talk about that because it's mine,
15:20
because I did it, and I think it's interesting. Okay. So let's think a bit about structured programming. The traditional structured programming is all your control structures have an in at the top and an out at the bottom. You can reason about programs much more easily if you use structured programming. Everything nests nicely.
15:44
When you think about what's actually going on with threaded programming, it is the most gloriously unstructured way of programming you can possibly imagine. Not only is the not one in, one out at the top, but there's one in and there's many out, and they spawn over here,
16:01
and they start running over there, and then they send messages to each other. And if, with Project Loom, you're going to have tens of thousands of threads, or hundreds of thousands of threads, which we can do because they're only a few hundred bytes, we somehow have to find a way to constrain that complexity
16:20
in such a way that we can predict how a program is going to work, we can analyze it, and so on. So simply firing off thousands and thousands of threads and passing messages back and forth and so on probably isn't going to work all that well. Yes, old spaghetti Fortran programs have got nothing on this.
16:41
So here we are, structured concurrency. This is the idea, and it's a very, very simple idea, that if you have a thread and it splits into a whole bunch of other threads, then we want to have a join at the bottom when all of the other threads terminate, and we carry on.
17:01
Whoop-de-doo, this is structured. And what's more, if the threads have just done some purely functional computation for you and they all join together at the end, what you've actually got there is a function. You can analyze it, you can reason about it. It has no side effects, but you've been able to use concurrency
17:21
to make it more efficient. So here we are. Here is an example of a structured concurrency construct. This is your executor service here. You submit two tasks to execute,
17:43
and this is a try with resources, so when you get to the end, they both join together, and it won't terminate until they're both finished. The executor submits, returns a future that can be queried for the results of whatever computation you were doing there.
18:02
I don't handle it there, but you would need to assign it to something. Error handling and cancellation works much better because everybody joins. Therefore, it's the responsibility of the join point to handle anything that went wrong in any of these threads that you just launched out. And also, thread cancellation is pretty cool as well
18:23
because you can either cancel one of the child threads or you can cancel the parent thread, in which case it'll all get propagated, and so on and so on. Now, this doesn't require virtual threads. This structured concurrency works perfectly well with any kind of thread, but it's very nice for this kind of work.
18:41
And so I've been looking at how far back this goes, and I'm fairly sure that the borough's large systems of the 1960s worked this way. So, yeah, it's good. Okay, now, thread locals. Java thread local, it's kind of heavy weight, it's slow, and all the rest of it.
19:01
Now, I'm going to try now to open a link. Let's see.
19:22
Here's one I opened earlier. This is an analysis that I did a couple of years ago of what actually happens when you say thread local get. And there's a tradition of my talks at FOSDEM, none of them are complete without some assembly language.
19:42
Okay, thank you. This is what thread local get actually does. It does a whole bunch of reads from thread metadata, reads thread locals, thread local table, length field, thread local hash code, look it up in a table, do the garbage collector magic because it's a weak reference,
20:02
we now have our thread local, and we're done. So thread local get is 12 field loads, five conditional branches. This is not by any standard a lightweight operation. So this was a couple of years ago, and the question is can we actually do any better than this,
20:21
and I kind of hope we can. So this is the proposal for scope locals, and the idea here is to do something rather similar to structured concurrency. You will declare, sorry, you will bind a scope local at some point in your program's nesting.
20:43
It will then be visible to everything you call by doing your scope local get, and it will disappear when you exit the same scope. They will also be inherited by your child threads
21:00
in your structured concurrency. Now, it makes sense for them to inherit it. It doesn't make very much sense for these scope locals to be mutable. I don't think it makes very much sense for thread locals to be mutable either, frankly, but I don't think anybody understood that when they were done 15 years ago or whatever it was.
21:25
Okay, and here's what it looks like. You would launch your child tasks, but you would bind a value to your integer there, and then it executes foo and bar,
21:41
and then in foo you would say myint.get. So it's like a thread local, but we've made it better by making it less powerful. This is a crucial observation about a lot of this stuff is that we've got interfaces and constructs and so on that are tremendously powerful, but in many ways they are too powerful,
22:01
and the sheer versatility of these constructs gets in the way of doing really efficient implementations, and it also gets in the way of the programmer being able to reason about invariants and so on. So scope locals. A fixed size local cache is the magic.
22:21
What I've done is that every carrier thread has a fixed size cache of 16 entries where the most recent scope locals that you've asked for are stored, and from this we load a pointer to the local's cache. C2 is plenty clever enough to hoist scope locals into registers.
22:42
Now that means the 12 loads and five conditional branches that I showed you for thread local can be reduced to just if scope locals are used in a loop, they will be hoisted into registers at the start of the loop and they will stay there. So this is what it looks like.
23:03
You've got a carrier thread here which has this 16 entry cache for scope locals. You've got virtual threads. Every one of the virtual threads will have a scope local hash map for its local bindings. When your program queries a scope local,
23:20
it will search through the virtual threads, through their parents, through the structured concurrency. Essentially it looks like a cactus, if you can imagine, with multiple branches. Load the value into the cache, and the next time that that cache is queried, it will find the value of your scope local and lift it,
23:44
and that's very, very fast. Worst case is basically just a couple of instructions if you've got a cache hit for your scope local. This works because scope locals are immutable. This is absolutely crucial.
24:01
Immutability is fantastic because once you've guaranteed that something is immutable, you can copy it, you can cache it, you can do all of these wonderful things. So by making thread locals less powerful, giving them a fixed lifetime, doing it in the simplest possible way,
24:20
what we've got is tremendously improved performance. So effectively the cost of a scope local is about the same as the cost of a load of a field from an object. And I'm done.