We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Nim concurrency & Parallelism

00:00

Formal Metadata

Title
Nim concurrency & Parallelism
Subtitle
Past, Present and Future
Title of Series
Number of Parts
287
Author
Contributors
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
This is a talk about Nim's concurrency mechanisms, how the old things worked, how the current things work and what the future holds.
String (computer science)QuicksortComputer fileBlock (periodic table)Military operationTerm (mathematics)Glass floatInterior (topology)Time domainRead-only memoryParallel portMaxima and minimaPresentation of a groupSoftware developerConcurrency (computer science)Programming languageVarianceResultantOperator (mathematics)DataflowMechanism designVariable (mathematics)Reading (process)Functional (mathematics)Multiplication signRevision controlSystem callSpeicheradresseThread (computing)Complex analysisElement (mathematics)Graphics processing unitParallel computingMemory managementSlide ruleParallel portObject (grammar)Computer fileBefehlsprozessorCore dumpFormal languageString (computer science)Constructor (object-oriented programming)Line (geometry)WordSemiconductor memoryTask (computing)Block (periodic table)Extension (kinesiology)Parameter (computer programming)CompilerCodeTerm (mathematics)Macro (computer science)Form (programming)MathematicsTable (information)Different (Kate Ryan album)NeuroinformatikBoundary value problemPoint (geometry)CountingSequenceFlow separationHeegaard splittingStandard deviationLibrary (computing)Function (mathematics)Data storage deviceOrder (biology)Direction (geometry)Atomic numberWeb-DesignerWell-formed formulaIntegerSubject indexingImplementationTesselationComputer programmingLattice (order)Complex (psychology)NumberPower (physics)CASE <Informatik>Rule of inferenceState of matterCategory of beingSingle-precision floating-point formatStatement (computer science)Virtual machineWhiteboardInheritance (object-oriented programming)ExergieView (database)Right anglePiRoutingDiagramComputer animation
Johann Peter HebelParallel portConcurrency (computer science)Computer fileTabu searchSynchronizationQuicksortString (computer science)Thread (computing)Data modelTupleInferenceSound effectCompilerDeadlockLevel (video gaming)Read-only memoryBlock (periodic table)Mathematical analysisCompilerAlgebraic closureVariable (mathematics)Statement (computer science)Sheaf (mathematics)Thread (computing)Physical systemVariable (mathematics)Operator (mathematics)Semantics (computer science)Block (periodic table)IntegerTable (information)Level (video gaming)String (computer science)Order (biology)Mechanism designCompilerProof theorySemiconductor memoryError messageMemory managementPatch (Unix)Game theorySound effectMultiplication signTask (computing)CountingInformationConcurrency (computer science)Hash functionCompilation albumJava appletException handlingLocal ringParallel portAxiomRun time (program lifecycle phase)Different (Kate Ryan album)Atomic numberWordFreewareParallel computingProgram slicingCasting (performing arts)Rule of inferenceNichtlineares GleichungssystemDataflowLibrary (computing)Set (mathematics)Electronic mailing listFunctional (mathematics)Integrated development environmentSystem callEqualiser (mathematics)DeadlockData structureDesign by contractBlogPositional notationInsertion lossCodeTemplate (C++)Structural loadCategory of beingTrailProcess (computing)FluxInstance (computer science)Adaptive behaviorNeuroinformatikComputer programmingCurveBuildingPhysical lawShared memoryType theoryMetropolitan area networkSelf-organizationGodJSONXML
Mechanism designData modelMultiplicationRegulärer Ausdruck <Textverarbeitung>Statement (computer science)Memory managementArc (geometry)ImplementationThread (computing)Token ringClient (computing)Content (media)HyperlinkSet (mathematics)Inclusion mapRead-only memorySpeichermodellIdeal (ethics)DataflowThread (computing)CompilerLink (knot theory)Semiconductor memoryLogicImplementationDataflowException handlingMultiplicationHash functionDeadlockCodeDependent and independent variablesPositional notationOrder (biology)Statement (computer science)Run time (program lifecycle phase)Interior (topology)Address spaceCategory of beingUniform resource locatorVirtual machineUniqueness quantificationString (computer science)Variable (mathematics)Loop (music)Mathematical optimizationContent (media)Atomic numberToken ringType theoryEndliche ModelltheoriePointer (computer programming)Memory managementWeb crawlerRule of inferenceTask (computing)Shared memoryGoodness of fitSpeicheradresseParameter (computer programming)Reading (process)NeuroinformatikVapor barrierMessage passingArc (geometry)Element (mathematics)Right angleData managementCountingProgram slicingDressing (medical)Connectivity (graph theory)Entropie <Informationstheorie>Multiplication signInsertion lossData storage deviceOvalAreaOverhead (computing)Lie groupFocus (optics)Library (computing)Text editorLevel (video gaming)Set (mathematics)Row (database)OctahedronLattice (order)Error messageMetropolitan area networkGroup actionElectronic mailing listReliefProduct (business)Computer animation
ArmFocus (optics)Mechanism designPhysical systemFunctional (mathematics)Task (computing)Type theoryObject (grammar)Computer programmingSystem callRight angleDifferent (Kate Ryan album)Concurrency (computer science)Data structureMereologyMusical ensembleForm (programming)Reading (process)Rule of inferenceAlgebraic closurePairwise comparisonDataflowEndliche ModelltheorieAreaStreaming mediaMachine codeProcedural programmingSemiconductor memoryInstance (computer science)Software bugGame controllerNatural numberWordFormal languageProgrammer (hardware)Constraint (mathematics)Group actionSocial classFunctional programmingProgramming languageStatement (computer science)Goodness of fitCodeVarianceParallel portOpen setComputer animation
Computer animation
Transcript: English(auto-generated)
Hello and welcome to my talk, NIM Concurrency and Parallelism. I'm Andreas Rumpf, the inventor and lead developer of the NIM programming language. And this talk is about NIM's parallelism and concurrency story,
the past, the present and the future. And I assume that you have some basic familiarity with the NIM programming language. But even if you don't, I think the slides should be easy enough to understand. Okay, so even in for NIM version one, we added some mechanism to do parallel programming with NIM in the form of a thread pool.
So, but before before I go into the thread pool, let's have a look at a concrete example. See, I want to count words like every word in the file. How often does it occur?
And I make extensive use of NIM standard library. So in the tables, you already have a count table that can count for us. And basically I read the file into a single string. I run split on it to compute the word boundaries.
And then I tell the count table to ink the word, which means to count it. And then I have four different files, text files, and I want to combine the word counts of every single one of them. So this is what the merge operation does.
So I have this account, a global count table and a single table, a separate table for every file. And then, so I merge the results here and then I sort the table and output the largest entry, which means the word that is most commonly used in all of these four text files.
And the goal is now to make this run in parallel because we have four files. We could have every CPU core run on a single file. So, and for this, we have the thread pool, so you can import thread pool and it gives us this spawn macro.
So it takes a function call and produces a flow var. Flow var is basically the same as a future, but in NIM future is already reserved for the asynchronous programming. So we had to take a new term and this flow variable data is, it's actually short for data flow variable.
And data flow variables have some nice properties, which I will get into. So this transforms, is transformed. So the spawn macro transforms it to something like compute a task out of the take, create a task with F and the arguments.
Run this task on the thread pool and wait for, or not wait for, but take the result out of the task and then you can read from it eventually.
So flow vars can be assigned only once. So by design, this prevents data races. And there is a read operation, which is spelled roof operator of X in NIM, which blocks until the result is available. And now we can use these flow vars and the thread pool to parallelize our code.
So in this case, we need to, we need to watch out that we don't read too early from this result. So first of all, we need to spawn like multiple tasks. And once we've spawned all these tasks, we combine the result, their results.
So, and this is a blocking read and yeah, this is all we had to do to parallelize it. So this is basically a three line change. The code is very easy enough to understand.
Here I have a different example, computing pi with these flow vars. So there is some math formula that you need to sum over in order to compute pi. Again, here I call spawn, so it fills this or stores the flow var into it, but the flow var is a future, right?
So it's basically, it doesn't block, right? And here then it blocks afterwards after we've spawned all these tasks. And now there is a different form also provided by the thread pool, which is called the parallel statement.
Which is much more complex to use, but has the benefit of not introducing these flow var objects. Because a flow var is actually a heap allocated object, so it can be quite expensive.
So here is the example how to use this, how to compute the same with the parallel statement. So in the parallel statement, there's again the spawn, but this time it doesn't produce a flow var of t. But instead the result is stored into this memory location directly.
And there is a complex, the compiler does a complex analysis to ensure this is still safe to do. That you do not access this memory location until the result is ready. Which means you cannot access this for reading until the parallel block is finished basically.
And here we lack the roof operator because it's just a sequence of floats, there is no flow var involved. We can directly take the floating point value. And then you can compare these two implementations.
And on my machine at least, the parallel version is roughly two times faster. Okay, here is another thing you might want to do with this parallel construct. Splitting up work into chunks and have a CPU working on a separate disjoint piece of data.
So in line 3 and 7 there is a linear find operation. It simply traverses the array and looks for the x. And if we don't find it, we return the highest integer value. Which means we didn't find it.
And then we have a huge sequence here. I mean, well, huge, I mean there is 1000 elements. And the idea is we take this linear find operation and we run it four times in parallel. And we combine these results with a minimum so that you get the first element in the array.
No, the first index where to find the element. And as you can see there is some math involved in this here. And the idea is that Nim proves these things here are disjoint memory accesses.
So there is no need for locks or atomics at all. And this is basically how GPUs work. So the idea was to have this parallel construct and to map this sub language that it supports to GPUs.
However, unfortunately, we never found the time to really do that. OK, one more example to see how this can get complex pretty quickly. So here I have an array with 10 elements and I call f on 0 to 2.
Then I start with the value 3, which is separate from 2, right? And until the end and I spawn f a of i and f a of i plus 1.
And this is disjoint because then I increment i by 2. And the compiler understands this and accepts it. And when you move the ink between these spawn calls, there will be an error because then it's incorrect. So this is really pretend worthy stuff.
How come we never developed this too much further? So this is how it worked. Like there is some, there's a proof engine in the compiler that understands integer arithmetic, that understands what it means things are disjoint, that there's this equality here involved.
So basically everything is translated into a set of equations based on these less than or equal. And yeah, there are axioms here and rules how to compute whether things are still disjoint or not.
So yes, the primary idea is these two slices overlap if and exactly if x is less than or equal to d and c is less than or equal to y. And so you have this less than or equal operation and you need axioms to understand what it means.
So the problem, this is really hard to teach to people. And it's also hard to ensure that the compiler gets it exactly right. And the compiler supports exactly the idioms that you want to use in your parallel program.
And even worse is that many, many people want this spawn. They see the spawn, think it's the same as golang's go keyword and use it for concurrency. And our thread pool wasn't designed for concurrency.
It was designed for parallelism. So they would get deadlocks. They want to say, yes, I spawned this task here, but this task is a long running one. I want to cancel it eventually, maybe. And there was no support for that, unfortunately.
And then we tried to patch it to in order to support that, but this never works well. OK, but there is an even worse problem for NIM. And that is the old NIM had thread local heaps. So here I modified the example and I said, look, just spawn this count words operation and access the tab with the lock inside.
So just share this count table. And yes, even though it's not a concurrent hash table, it doesn't matter. I'm going to use a lock.
It will be fine. And the problem is it's not fine. It's fine in Java and C sharp. It's not fine in NIM or wasn't because inside the count words, it would allocate a string. The string would be thread local to this thread who runs it. And then it would try to store the string into the table and the table is owned by this main thread instead.
So that would be that would lead to a corrupted memory. However, it wasn't that bad because the compiler actually prevented you from doing that at compile time.
So and here here is how it did it. So NIM has an effect system which is orthogonal to its type system. So every function in NIM not only has a return type, but also a list of effects. And thankfully, these effects are inferred most of the time so that you don't have to think about it.
And there is a couple of different effects that we can track side effects. We can track what kind of exceptions are raised. We can track tags. There's like an execio effect which you could use, for instance, in your game. If you want to ensure at compile time that certain subsystems of your game engine
do not use the scripting layer, for instance, you could use that mechanism to prevent that. It tracks locking levels and the lock levels prevent you from introducing deadlocks. And finally, it tracks this GC safety property, which means that you
cannot accidentally produce a collection that consists of memory owned by different threads. So we call a PROC P GCSAFE when it doesn't access any global variable that
contains GC memory, either directly or indirectly through a call to a GC unsafe proc. To override the compiler's GC safety analysis, a cast pragma block can be used. So here I have an example. There is some global string here.
And there is a thread local string. And all I really want to do is initialize it. So I want to copy this string into this thread thing. And here I need the GC safety. I mean, I don't need the annotation, but I need it to be GC safe so that it can run on the thread pool.
But it wouldn't let me because I access this global here. And so this means it's not GC safe. So what I need to do here is to cast this GC safety problem away. And then it would compile and it would actually work because I used this deep copy here.
So when you have thread local heaps, in order to pass information around, you need to perform these deep copies. And in the new NIM with the ARK ORG mechanism, we finally get a shared heap. So these deep copies do not have to be performed anymore.
So the old NIM also had some idea of how to avoid races and locks. Even though if you wouldn't use these, I mean, you could use the flow bars. But for low level code, sometimes it's unavoidable to use guards and locks.
So a guard annotation can be applied to a global variable. And then the semantics are that every access to this variable needs to happen inside a lock environment. So let's have a look. So here I have this gdata variable and it's protected by this lock.
And this is this guard, it's guarded. And then the compiler says like here, you cannot simply use gdata. You need to use it inside a lock.
And this would then compile. However, this is very ugly syntax because you should never write this except inside a template. So the thing here is this is simply an annotation that makes the compiler happy, but it has no runtime cost.
Like it doesn't lock. It's called locks as a pragma, but it doesn't look, it doesn't do anything at runtime. So what you are supposed to do is you have your own template lock or use one from some library where it's actually acquire and release.
And but the compiler doesn't understand acquire and release. The compiler looks for this annotation to ensure that inside the body you can now access what is guarded by the A here. So and this distinction between what the compiler wants and what the runtime does is very handy when you have lock free data structures,
because you might as well say, hey, I have an atomic counter and it's guarded by this dummy lock here. And it's not a lock. It's just of type int and it's at compile time. So this variable doesn't even exist at runtime.
And I have an atomic read instruction for this. And there is like here this is to make the compiler happy. And here this is basically at runtime I need to ensure a read barrier before I access the X. So this would, this all predated C++'s memory model, by the way.
So there is some something to work to do here to bring this into the new century. OK. And finally, this locks doesn't, this locks pragma doesn't just support a single argument, but actually a couple of arguments.
And you can use that to say, actually, I have a multi lock. It takes two of these. And in order to prevent deadlocks from happening, I impose an ordering at runtime. And to impose this order, I simply compare the machine addresses of these pointers.
And since this order is guaranteed, there cannot be no deadlock here. And again, this pragma locks notation tells the compiler what this code actually did.
OK. As I said, this was the past. We now have ARC and ARC memory management based on highly optimized, highly tuned reference counting. So we offer now a shared heap and we give you a new channel implementation that doesn't perform these deep copies anymore.
It can simply move data around. So you have a simple example. This is the threading channels library. So it's an it's available as a Nimble package, but it's officially supported. And you have a simple example which doesn't really do anything like I have a receiver thread.
It receives strings. Maybe it's the stop token, which would be the empty string. Then it stops working. Otherwise, it says, hey, I'm thread whatever. And this is the message I've received.
And then there is some thread creation code. So I have two receiving threads here. And then I send the main thread sends these tasks as strings. And then I need to send two stop tokens in order so that every single thread gets its own stop token.
And then I can join the threads. Here is a better example, which is harder to use more code. So this is a web crawler example. I have a download thread that takes a URL unless it's the stop token.
It downloads the URL. It parses the content as HTML. And then I iterate over the HTML finding all the links. And the links is there is sent to a response channel for a different thread to pick up.
This is the collect links thread. And so this collect links thread is the only thread that might access this global hash set unique links. And again, we need to tell the compiler this is really acceptable, even though unique links is a global variable.
And I, but this is the only thing I need to do. I don't need locks because this collect things is links is the only thread that is allowed to access unique links.
And then finally, I start three threads to download threads one collect links thread. And I send some tasks like crawl this these these two URLs. I need to tell the download threads then to stop and to wait for them.
And then I know the downloads have been finished. And then I can send the response stop token, which means stop the collect links thread from running. And now I know this I can join this thread three, which is the collect links thread.
And so now I know no other thread can access unique links, and I can echo it. And this works. And the nice thing here is that this finally that basically this construct.
Maybe sometimes you need a lock if it's not a single thread but finally this is allowed like you can have. It's not how to put it, you can access shared memory, and you can send data between threads by moving it instead of deep copying it.
Okay, and in the future. Now that we have a shared memory heap. We need a memory model right now we have, we simply use C++ as memory model which is fine. But it's quite conservative and C++ basically every memory location might be shared between threads and then the
compiler must prove that cannot happen because we don't take the address of the memory or something like that. And then we could be more, we could have a different rule like actually only ref t heap pointers can be shared between threads and you cannot just take an address of the stack and pass it to a different thread we could prevent it at compile time,
which might allow the optimizer to do better work and I'm not sure it needs to be researched. Now that we have a much better channel API, which is also a flow bar by the way so if
you can use the channel as a flow bar by simply storing one element and only one element inside the channel. Maybe it's nice it's a good idea to wrap this in the flow bar to enforce this in the API. Well, that's probably the best idea. So we should do that. And, but now that we have this channel implementation, which really the thread creation API really needs some some overhaul.
And, or maybe you said you want to say no like create thread it's fine it's a low level thing. It doesn't have to need a nice, nice API. But we, this spawn and flow bar, these things that we did in the past really
were the completely right ideas. We want a re implementation a better implementation for arc and ORC. And there is no need for the parallel statement because you could also think of optimizing the flow bar implementation so that it has less overhead.
And detecting at runtime like these these disjoint slices computations. They are actually not required when you have a good flow bar implementation because, yes, the disjoint property is then basically done at runtime so you would get an exception
when you try to store into a flow bar more than once because you, you had non disjoint flow bars. Let's say, but I think this is good enough. So we could, we could remove the parallel statement and all the complex logic around it and focus on spawn and flow bar.
And flow bars also combine very, very well with async programming, the async event loop. So this is what we will do in the future. Thanks for listening. I'm now available for questions.
I think. So do I have to switch channels or no.
No. We are live now, I think. I think so. Yeah. I mean I have I have the talk open in another tab so I will see it yeah okay I see us now. Okay, so everyone. Please ask your questions.
We have some early questions so we can probably discuss the first question that we got, which is, how does NIM compare to Golang? Well, it's these questions. It's completely different in my opinion, the syntax is different, the type system is different.
The interoperability story with C and C++ is different. The concurrency model is different as we have, hopefully, got across with this talk a little bit. You can say that Go is ahead in the concurrency topic, but we are catching up quickly.
So, yeah, it's never too late to to bet on NIM. And, yeah, and you, but, but what I, what I really don't like about these questions is like the focus
on this one other language like you could compare NIM to Elixir for instance and it would be an interesting question. Or compare it to Swift or C sharp like there are so many programming language what's so special about Go. I, I think why people compare NIM and Go so much is because they superficially they target the same things like it's native code.
It's focused on procedural programming idioms, rather than, say, class based OOP-ish things or functional programming things.
So I think that's why but there's there's so much. There are so many programming languages out there. So please go ahead and yes, compare NIM to Go, but compare it to others as well there's more the programming landscape is really vast.
Okay, thank you Andreas for that. So, I'm just looking at the other questions I think we've got one from Zigoi. So, are there any general guidelines to which mechanisms you should use as a go to when you want to add concurrency or parallelism to your code.
Um, yeah, I think, once we are there in the future. You really want to say spawn, like, let some flow equals spawn some function call. And that's how you how you should do it.
And you need to watch out that what you do in this function call doesn't violate the constraints of your system. So one constraint is that your function must be, must not access global memory in some way like like it's an uncontrolled way.
So if you, if you have a function you call it in spawn, the function accesses global data, like in global variables, you should, it really should have a lock around these accesses. And, but yeah when you consider that, that every function like in the spawn the crucial idea is
really that you can return something from this function in this, this with this future of lower based mechanism. Because then you can ensure that, that you know when the when the function that you run concurrently when it really has completed.
Right. So there you can compare this to this, which is rather unbroken or this the structure it. Now what's it called structured concurrency right there is some what's it called some some some supervisor object that that kind of wants you to, to ensure that
that you, that you joined all your tasks eventually right. So there's basically also fork and join. It's also a keyword here.
So and when you compare this to this structured concurrency mechanism. It's really the same thing like you, you call a function, it gives you eventually it gives you something back. And when it does that you know the task has finished and so the structure is preserved like it's still a very hard to say, a very
easy to understand it remains easy to understand when once you do that. And now you can of course again compare it to go. And you'd have to go keyword and you run something in there. What does the go keyword return. Nothing. So you then need to these
work groups I think they're called and go, you need to pass these via the closure to the function that you call inside the go statement. And once you do that, you, it's harder to for it's harder for the compiler and for the programmer to
understand what what was really meant right. You say, go function F, but it never, never comes back to you. So that's very, very unstructured. And so yeah, this is this this spawn with flow vars is the go to mechanism for concurrency
in in 22. Hopefully, nice. I'm asking, I asked people for more questions. I see people are typing.
Fair enough. I so fruitless. No worries. Well, we only have one minute left anyway. So yeah. Oh, Pietro Peter Longo asks, What are your thoughts on CPS?
Yeah, that's, that's a good question that's related to this. When you when you spawn a function, and it wants you and maybe you want this function to resume, like you produce a future, but not just one future but kind of like a stream of futures.
And so your function that you're about to spawn should really be resumable. And that's where CPS comes into the picture, helping us transform into a function into a resumable function. And that's, that's how this relationship should work out, in my opinion. And yes, we have something like that already. And within the form
of closure iterators, but it's, I think the problem with currently is that it's too tied to the single threaded model, this is our existing
async model. And when we start with CPS from scratch, maybe we will get something better.