ParallelFx, bringing Mono applications in the multicore era - TIB AV-Portal

ParallelFx, bringing Mono applications in the multicore era

00:00

1

Laval, Jérémie

Formal Metadata

Title

ParallelFx, bringing Mono applications in the multicore era

Title of Series

Number of Parts

97

Author

Laval, Jérémie

License

CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/45739 (DOI)

Publisher

Release Date

Language

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

Multicore computer are now part of our everyday life. Most desktop and laptop machines out there bundle a dual-core processor, quad-core processor or even 8-core processor by default. This multiplication of the number of core on the same chip is destined to become the way for manufacturers to remain competitive. However, developers were a bit left out in this process, having written sequential programs for ages whereas they were now required to parallelize their program to make them efficient which isn't an easy step to take. That's why we now see the apparition of framework designed to help programmers to take advantage of this new architecture of processor by hiding away the parallel difficulty under primitives that they are used to. ParallelFx is one of such framework for the Mono and .NET world. By providing several new parallel constructs and concurrent data structures, it allows Mono applications to enter painlessly in this new multicore era. This talk will cover several points : * What options Mono provide to speed up applications today * A bit of background on the framework * The components ParallelFx provides * Some how-to and usage of the framework * Status of ParallelFx in Mono

Speech

Text

Image

00:00

CodeSoftwareProgrammer (hardware)Formal languageLine (geometry)Cartesian coordinate systemField (computer science)BitState of matterBefehlsprozessorScheduling (computing)Point (geometry)TrailTask (computing)Multiplication signLimit (category theory)Polarization (waves)MereologyNumberFilm editingRange (statistics)Machine codeWave packetMultiplicationMulti-core processorForcing (mathematics)Thread (computing)MicroprocessorRight anglePhysical systemData storage deviceContext awarenessIdeal (ethics)IntelOcean currentMetropolitan area networkComputer programmingCountingGoogolView (database)MathematicsPrototypeMathematical optimizationProjective planeOperating systemVirtual machineDoubling the cubeShared memoryEndliche ModelltheorieParallel computingComputer-assisted translationCuboidParallel portCurveScaling (geometry)Serial portLecture/Conference

06:59

Machine codeMereologyPhysical systemNormal operatorData structureNumberNeuroinformatikParallel computingMulti-core processorTask (computing)Thread (computing)Multiplication signBlock (periodic table)Scheduling (computing)BuildingSource codeCartesian coordinate systemWeightLevel (video gaming)BitNoise (electronics)Revision controlTrailRun time (program lifecycle phase)QuicksortProgrammschleifeCoordinate systemParallel portSet (mathematics)Zirkulation <Strömungsmechanik>.NET FrameworkComputing platformLibrary (computing)System callVapor barrierComputer programmingProcess (computing)Wave packetSynchronizationIntegrated development environmentLocal ringFreewareMaxima and minimaPoint (geometry)Key (cryptography)Link (knot theory)Object-oriented programming1 (number)Shift operatorSocial classVirtual machineShared memoryCompilation albumRegulator geneDifferent (Kate Ryan album)SequenceSound effectConstructor (object-oriented programming)Element (mathematics)Hash functionForm (programming)Normal (geometry)Lecture/Conference

13:59

Task (computing)Image processingExecution unitShift operatorOperator (mathematics)Scheduling (computing)QuicksortConservation lawComputer programmingLogical constantLoop (music)Multiplication signData miningPoint (geometry)Software developerMultiplicationVideo game consoleIntegerSource codeNeuroinformatikToken ringResultantCalculationSystem callInterior (topology)MereologyAnalytic continuationCartesian coordinate systemDemosceneProgrammer (hardware)Process (computing)PixelEndliche ModelltheorieMoment (mathematics)Range (statistics)Machine codeCASE <Informatik>Thread (computing)Right angleParallel portRun time (program lifecycle phase)Functional (mathematics)Network topologyLevel (video gaming)Element (mathematics)Video gameSoftware frameworkFactory (trading post)INTEGRALObject (grammar)SummierbarkeitParallel computingMedical imagingDemo (music)Lecture/Conference

20:58

Total S.A.CodeClique-widthConvex hullFraktalgeometrieParallel portRevision controlDifferent (Kate Ryan album)2 (number)Multi-core processorMultiplication signParallel computingWordLine (geometry)Loop (music)MathematicsParallel portMachine codeGastropod shellTunisLecture/Conference

22:18

FraktalgeometrieTotal S.A.Inclusion mapComputer wormLoop (music)Execution unitCodeCalculationPoint (geometry)Process (computing)Film editingParallel portLoop (music)MathematicsMultiplication signLine (geometry)2 (number)Dot productLecture/Conference

23:01

Inclusion mapConvex hullFraktalgeometrieThumbnailPersonal identification numberRevision controlMenu (computing)Execution unitTwin prime2 (number)Computer programmingFunction (mathematics)Line (geometry)MathematicsMultiplication signCoordinate systemLecture/Conference

23:44

Presentation of a groupFilm editingComputer-assisted translationLink (knot theory)Query languageMathematicsLine (geometry)Multiplication signError messageMereologyLoop (music)ProgrammschleifeQuicksortOperator (mathematics)Set (mathematics)Social classLevel (video gaming)Parallel portRegulärer Ausdruck <Textverarbeitung>Point (geometry)Enumerated typeRange (statistics)Source codeDemo (music)Lecture/Conference

25:54

FraktalgeometrieCodeTotal S.A.Convex hullParallel portInclusion mapMenu (computing)EmailHardware-in-the-loop simulationDean numberExecution unitNormed vector spaceInternet forumParallel computingComputer programmingIntegrated development environmentQuery languageLink (knot theory)MathematicsState of matterCartesian coordinate systemProgrammer (hardware)Ray tracingSound effectSoftware testingFunctional (mathematics)Quantum stateLine (geometry)Network topologyVariable (mathematics)Lecture/Conference

27:49

Total S.A.Parallel portCodeFraktalgeometrieSimultaneous localization and mappingIntelMoment (mathematics)Graph coloringProcess (computing)Reflection (mathematics)TouchscreenMereologyRow (database)Loop (music)2 (number)Multiplication signPixelCodecRange (statistics)Query languageLine (geometry)Medical imagingNeuroinformatikGastropod shellFunctional (mathematics)RecursionTask (computing)Sound effectThread (computing)SequenceParallel portVolumenvisualisierungMulti-core processorLecture/Conference

30:06

MereologyMedical imagingTask (computing)Link (knot theory)Enumerated typeQuery languageProcess (computing)NumberScheduling (computing)Revision controlStandard deviationPhysical systemSimilarity (geometry)Thread (computing)Reflection (mathematics)2 (number)Cartesian coordinate systemProjective planeMultiplication signState of matterElement (mathematics)Sound effectVirtual machinePixelOperator (mathematics)Ray tracingMachine codeHeuristicParallel computingFunctional (mathematics)WeightMathematicsRange (statistics)Software frameworkSubject indexingCoordinate systemBitAverageParallel portMusical ensembleStructural loadComputer programmingLoop (music)Extension (kinesiology)Overhead (computing)Mathematical optimizationElectronic mailing listTrailPoint (geometry)Wave packetLibrary (computing)Image processingSoftware bugVideoconferencingCASE <Informatik>Type theoryMobile appTwitterSet (mathematics)Object-oriented programmingFlow separationPhysical lawTape driveIntegrated development environmentCasting (performing arts)Lecture/Conference

Transcript: English(auto-generated)

00:09

So, hello everybody. So, as Ruben said, sorry again. It's again a talk on, about improving performance of application in the model. So, actually that's a good thing. So, my

00:20

name is Jeremy Laval. And actually I would like to note that the work, so I implemented Paralletics for model. And I would like to say that this work was part of actually Summer of Code, Google Summer of Code. So, first of all, I would like to thank Google and Mono for allowing me to work on such a cool project with such cool people. And, well.

00:40

So, in my talk, I'm going to first talk about why actually we have to bother with all this multi-trading stuff. Then, I'm going to talk about how we can actually use that to improve performance in all range of application. And then I will focus on Paralletics in itself, showing what you can use in Paralletics, what is all the tasks,

01:03

or what's the scheduler, or what's spelling, which is a very nice part of Paralletics. And finally, oh, I mean, what's the state of thing actually in Mono. So, how much it is implemented actually. So, first of all, why are we bothering with all that parallelization stuff? The thing is that everyone loves a single-traded application

01:23

because we have been taught to program like this. And I mean, every language, pretty much every language works on a single thread when you code. But, as a wise man said, the ideal number of thread you should use is one. So, I agree with that, but we are going to see that it must have a special twist about it. And, again, we are bothering with

01:47

parallelization today because the freelance is over. I will explain what is exactly the freelance, but basically the thing is that we need to change the way we code. We have to throw away a bit our point of view or serial programming.

02:01

So, what is exactly the freelance? From about the 70s to 2005, chip manufacturers were focusing when they were producing CPUs. Basically, they were fighting over the clock speed of the CPUs. So, that was really cool because basically programmers could write a program, say, like in 1995, and it would run pretty well. And, like, five years later,

02:27

actually, maybe the, so, by Moolah, I think it's like double lecture. So, basically, the year later, the clock speed of the CPU would have doubled. So, basically, your application would be twice faster than the original. So, that was pretty cool because

02:43

you had to do pretty much nothing, and as time goes on, your application just becomes faster. Well, of course, it's not a paradise because, at some point, you have the physical limit of actually the chip. I mean, you can't put too much transistor. We have arrived actually, so about 2005, and you can actually see on the curve because from

03:04

about, so, the 70s to pretty much 2005, we have pretty much straight lines. So, it's like the clock speed actually increase each year. But we are on a point that we can't pretty much accelerate anymore your CPU. So, manufacturer had to find another way to

03:21

actually remain competitive in the field of CPU manufacturing. So, the bottom line is that right now, we can't count on that freedom anymore, and we have to change our way. Clock speed war is a no-go. So, the awesome idea that Intel forks and AMD forks had was, well, we can scale vertically, so let's just add

03:44

more core and scale horizontally. Of course, there is a problem because, and actually, so, they decided to do that first by putting two core in one CPU. So, that was, at first, we had the range of, sorry, I'm using Intel name, but

04:00

well, it would be the same for AMD, but basically, we had the Pentium line, which was only one core. Then, they introduced the core duo, which was like the first dual-core processor. Then, they moved on like quad-core processor, and now, recently, they have released the i7, which is like 8-ish core. Basically, it's like they are 4-core, but 8-core is like hyper-traded. So, for the system,

04:22

it's like there is 8-core. And now, they are even designing prototype, which has 80-core. The thing is that, yeah, so, basically, we can't take advantage of all that stuff like the way we used to with clock speed, because your

04:41

application isn't going to be faster as the number of core increase. You have to do something in your application. You have to actually break up work, so that you can have separate tasks. I mean, you have to separate, actually, your work in multiple tasks, so that actually, each of the core in your processor is doing something. So, the solution to actually

05:00

remain competitive also in software is to break up work and share it among cores. But, as you can suppose, it's not really that easy. Actually, so, as you can see, even a cat has problem doing multi-threading code. So, yeah, the truth is that parallelization is a difficult thing. It's hard. So, as Alain said, you have to manage your thread. You have to manage your

05:25

lock. You have to be careful. Is that part of the code? Can it be called by multiple thread? Am I taking the lock in the right way? Am I releasing them in the right way? Because it's tedious to do that. I suppose everyone who has ever done multi-threading programming

05:43

at some point had to debug deadlocks. I'm pretty sure we agree that it's not a really nice business. So, and finally, it's also most of the time inefficient to actually manage the thread yourself. Because, actually, the question is, how many threads should you use?

06:02

Because if you use too few threads, like, say, you are developing, you have a four-core machine, but you use, like, two threads, you aren't going to use all your cores. So, basically, it's under-efficient. But then, if you use too much, like, you are using four threads on dual-core machine, basically, you have, so, actually, the operating system can only run two

06:23

threads at a time. So, if you have four threads, it's going to change, I mean, at some point in time, it's going to change the thread which you are executing. And that creates a lot of context switching. And if you use too much thread, actually, the context switching, the cost of context switching your thread costs more than the actual optimization of running it in parallel.

06:43

And, again, that brings this point that, what if the number of cores changed? Because you could develop your application on, like, a really big box with, like, eight cores. But, again, if you ship it to a guy who only has this small dual-core or even one monocore still exists, it's going to actually perform very badly. So, well, that's getting complicated. So, we need some

07:07

key principles in that. We need, really, something different, something that should automatically regulate thread usage at runtime. Basically, like, we went from compiled code to interpreted code, or, like, JIT code, JIT compilation, because we want to be able to

07:23

optimize the code at runtime for a specific machine. And, actually, we have to find something analogous to that form of the threading programming. And the second point is that it should be as straightforward as possible, because, as I said, most people are used to

07:40

the way they code, I mean, the way we are taught to program is really sequential. So, the shift from sequential programming to parallel programming should be as painlessly as possible if we want people to adopt this kind of programming. So, it should be ever-mimicking familiar construct. That is to say, well, in parallel fix,

08:02

for example, we have the parallel loops which act like any four loops you have been using for ages. Or it should reuse existing code with only slight modification, like one method call, just an added method call, for example. So, enter parallel fix. So, parallel fix is like, it was developed by Microsoft, like, in two or three years ago. And, actually,

08:26

it was supposed at first to be bundled as separate library. It actually seems that it's going to be part of the whole .NET for platform. And, actually, parallel fix is integrated in Colib, so the very heart of .NET. So, always parallel fix. Basically, you have

08:43

this part, which is the actual primitive that you can use to use parallel programming inside your application. And the smaller part is like, so the really lower-level part is all the scheduler. And it's really, it's actually a really nice piece of technology. I'm going to talk about it later, just after. And then you have, like, almost free building block

09:04

that you can use low-level building block, which is the task. So, the task is basically, like, thread, like you would use your normal thread. Future is, like, just build upon that and, actually, well, I'm going to talk about that later. So, we have parallel loops. Parallel link is, like, I hope everyone is familiar with link. Anyway, if you have

09:24

seen Miguel talk, you should be familiar with link. So, I will just say that you've seen Miguel talk. So, which is actually a parallelized version of link. And then, of course, we can't do like this without any helper class. So, basically, we have a whole new set of collection that are thread-safe, but which shouldn't use,

09:43

yeah, thread-safe collection, and also a whole new set of coordination data structure. So, lock and all that sort of sub-bar here. Well, any kind of thing that you could use in a multi-threaded environment. But, again, they are really specific because, like, for example,

10:01

concurrent collection, they try to don't use lock because lock is really inefficient. So, it tries to actually be lock-free, what we call lock-free code. So, yeah. So, let's pick a bit about, actually, the scheduler, which is the art of all the parallel effects library. Actually, the scheduler is a really nice piece of technology because, instead of

10:22

just creating, like, you would say, instead of creating a thread for hash task, you do. Basically, I mean, some guys said earlier with Alain that, basically, the number of thread you should be having in your application should be equal to the number of core that

10:41

a computer has. This is really true. So, actually, parallel effects build upon that, and it always manage an element of thread that is equal to the number of core you have on your computer. But, of course, it should also be able to handle, like, 1,000 tasks at the same time. So, you have to make a trade-off. You have to run 1,000 tasks on one

11:02

hand, but you also have to keep only two threads. So, actually, the scheduler is built on the work-stealing principle. Basically, a mono application feeds the scheduler some tasks, which first go into a shared work pool. So, then, the thread worker. So, the thread worker is, like, simply a worker which wraps up an OS thread. So, an OS thread is like

11:22

system.trading.tread. And then it takes up that work. And, really, the cool part of it is, like, so, let's say that, for example, you feed 100 tasks to the scheduler. Then edge thread worker, so if you have two thread worker, let's say that edge thread worker process, like, 50 tasks. The thing is that most of the time, tasks do not take the same

11:45

time to execute. That's the whole point of it. So, if it was just like the thread worker took some work and then execute what he has, basically, you add one thread worker which, at some point, run out of task. So, you would have one thread worker which continue executing, and, basically, you would have one thread which is wasted because it does nothing.

12:03

So, the cool thing is that, in parallel effects, when the thread worker has no more work to do, he will try to seek work to do in other thread workers. So, that's the still part. That way, every thread, at any time, is still, should be still processing something. That way, you get the maximum performance out of your code.

12:22

Q What's the reasoning behind keeping a local work pool? Can't they just each pool one job at a time for the thread worker? Actually, so, as I said, the problem, basically, you could do your work only with a shared work pool. So, a shared work pool is like a normal collection with a lock, basically. Q So, that's getting a lock, then? Yes, that's a bottleneck. Because, actually, as I said, you could, I mean, parallel effects

12:43

should be able to handle, like, 1,000 tasks at a time. So, if you do, like, one for the time, acquire your lock, release the lock, you are going to kill your performance. So, what we must ensure is that you have the maximum locality, what we call locality, is that, basically, a thread worker, in normal operation, should be doing any locking,

13:01

any interlock method, any code like that, any synchronization stuff. So, what we should do first is that each thread worker should only work on what is local to him. And, actually, only the still process actually do some locking, or it's not locking, actually, it's using interlock CAS. If anyone know CAS, what CAS stands

13:22

for, it's doing CAS work. So, that scheduler is really the art of parallel effects. And, actually, it's so efficient that, basically, the thread pool in .NET 4 is based on that scheduler. Basically, if you are using thread pool, like, if you use thread pool before .NET 4,

13:41

it was like, if you feed a task to the thread pool, you will create a thread for each task. And that was pretty inefficient, because if you get a lot, lot, lot of tasks, well, you just blow up your machine. So, now, in .NET 4, actually, when you queue up a task with a thread pool, it actually create a, so, a system, a parallel effects task, and just feed it to the normal scheduler. And, actually, so, it's

14:03

not the case at the moment in mono, but we are pretty much working on it, we are trying to integrate that. So, hopefully, in some time, we should have, also, in mono, a thread pool which work on that scheduler. So, now, let's speak about, about each way you could actually parallelize

14:22

your, your things. This is like task. So, task, you should, you should use it like any normal thread. Basically, a task is like a small, a unit of work in your application. So, that's a kind of traditional way to use a task. Basically, you may create what is called a constellation token source, which

14:41

basically, under in parallel effects, all the constellation stuff. In parallel effects, constellation is like at the heart of your framework, because every operation, so be it task, weight, well, any kind of operation can be canceled at some point of time. So, instead of having, like, on the task, a cancel method, for example, they made a more generic framework building a talk, cancellation token source. So,

15:03

basically, you always have a source of cancellation, and then you pass around some token, and you can actually pass the same token to multiple method call. That way, when you call actually cancel on the source, it is able to cancel multiple work at a time. And then, you can chain together multiple cancellation token source. That way,

15:22

you can, well, handle really pretty nice, pretty neat stuff about cancellation. So, basically, a task is pretty much used like you would, I mean, a task is created pretty much the same way you would use the thread pool. You just call a factory object, and then start you, and you fill it with, like, a function, and basically,

15:41

it will try to execute that on a separate thread. So, the method is called non-blocking, so you just continue your execution like a normal thread. And it's stuff also which is available is all the continuation, the continuation philosophy. A continuation is basically something, so you will take basically the scheduler. It's like a callback, actually. It's

16:02

when you finish this task, I want you to run that thing, too. And again, it's really just a task. So, again, it can be scheduled on another thread. It can be completely scheduled on another thread. Well, you have some method also to wait, of course, on it. Then, you have the future, which is

16:23

really an interesting way to program. Future is like, actually, it's a shift of mind, because future should be used, like, you have some sort of calculation going on. And basically, what you should do is, like, instead of actually calculating a result of your computation, you should say to your code, okay, I'm doing computation.

16:43

The thing is that I don't have yet the result, so I'm going to give you, instead of the, like, instead of giving you an integer, I'm going to give you a task int, which basically say, I'm computing the value. I don't have it yet, but just use what I give you, and I promise that, at some point, you will have a value.

17:02

And this is very neat stuff, because, of course, the future is like a task. So, what I say about delayed execution is that it gets executed on another thread. And the cool thing is that you can actually chain future together. So, you never actually process anything, but at the end, when you need the value, it's going

17:21

to get processed automatically on several threads. Like, for example, here, you have a tree, so it's basically a method which you have a tree of integer, and basically, it's a method to try to calculate the sum of each element of the tree. And you can actually see that the right part of the tree is actually done, so, yeah, the right part actually is

17:41

done like normally, second shelly, but then the left part of the, so all the method is actually, you know, recursive. And actually, the right part of the tree is like taken as a result, but the left part of the tree is calculating here, again, start here, but with a future. So, basically, all

18:02

the left part of your tree will be chaining up future together, that call later value. So, like value is a blocking call, but basically, by the time you are finished calculating the right side of your tree, the left side should have already been computed by, because the future is executed on another thread. And so, you can chain up,

18:22

and the cool thing is that, of course, you have a relationship between each future, because the future depends on another future, which depends on another future, et cetera, et cetera. And it's handled by the runtime, so that, actually, when you request the value of the topmost future, we just walk down all the way down and get the result from the lower level future.

18:42

So, this is, you have to change your code, actually, to make use of future, but it's a really nice way to program. Then you have one construct which is really used a lot, which is the parallel for. So, again, when I said earlier that parallel fish should be able to mimic existing construct, this is one of the construct

19:02

which, I mean, everybody has ever done a for loop. And the for loop is actually something that you can really easily parallelize. Basically, you say that each thread should be doing, like, if you have a loop which goes from zero to, I don't know, 1,000, basically, if you have ten threads, you should just say, like, thread one does the walk from zero

19:23

to 100, the other thread do the walk from 101 to 200, et cetera, et cetera. So, it's very easy to parallelize. And, actually, parallel for just does that. It tries to partition the, how do I say, the range of value into smaller range that each task can process individually.

19:44

And, actually, the really cool thing about parallel for is that, as you can see, it's almost the same as a traditional for loop. You just say parallel dot for the way you, where you want to start, where you want your hand, and what you want to execute with the value. And that's all. Behind the scene, it's all going to partition the stuff,

20:02

parallelize your stuff, but you don't know. So, that's one of the actually where I agree on Alan saying that developers should really use one thread. The twist is that developers should think they are using one thread, and this is one of the ways to ensure that. Basically, the programmer don't care. He just writes that up and thinks like it's a normal

20:21

for loop. But, actually, behind the scene, without him knowing, it's doing parallel stuff, and it's actually improving its application. So, demo time. Actually, I'm going to use parallel for is really useful, as you saw here, for image processing again, because, basically, when you do image processing, pixel-based processing, you just iterate

20:41

over the whole range of pixels you have in your picture. So, an easy way to parallelize that sort of stuff is, basically, to use parallel for on the outermost for loop you have in your program. So, actually, sorry.

21:00

Anyway. So, here I have a little code, which basically is a fractal drawer, and really, the only difference you have between like a second, oh, should maybe. So, that's all the code, simple stuff.

21:20

Basically, the only difference between a parallel version and a sequential version is that the for loop you have here in the sequential version is just changed to a parallel for one. So, basically, the only difference you have is like modifying this line and this line. Two lines change, and you get most of, so, like I have a dual processor here, and you get twice as much performance

21:41

as the sequential version. Don't take my word for it. I'm going to show you. So, here, I'm going to run it in, so, this is the sequential version. So, of course, it's heavy usage. I mean, parallelization is very good when you have ever large data set, or if you have like metal, which takes

22:00

a lot of time. So, I have a metal, for example. So, if everything goes well, here you go. So, like, 26 seconds to actually produce that stuff.

22:22

Of course, you can criticize my art skill. I'm actually not an artist. Basically, this is like a fractal, a Mandelbrot fractal, but it's just for the point of having an heavy calculation. So, as you have seen, the processing of that stuff took actually like 30 seconds. Now, let's run it in parallel.

22:42

So, again, I say it with a two-line change. So, now I'm using a parallel for a loop. Still taking some time.

23:00

Here you go. 15 seconds. Two lines change, and you have approximately like 10 or more seconds speedier program to line change, and actually the same output. So, it's going to take some time

23:22

to go back to where I was. Sorry for this. It sounds like there wasn't coordination between the two stocks, yours and Alan's. Yeah.

23:42

Maybe a little. Maybe for the cat. That way, if you add an MS to anything, you can just see everything back.

24:02

A speedier presentation. That was around here. So, here I spoke about actually the lower level stuff of parallel effects, which are the tags, the future,

24:21

and the parallel loops. So, actually, I showed parallel for, but you have actually also parallel for edge. So, you can also parallelize any for edge loop, which basically iterates most of the time in collection or link request, actually. But to do that, actually, so speaking about link, you can have also very neat stuff, which is called parallel link.

24:41

Parallel link basically takes up your link query. So, I hope everyone is familiar with link again. I hope. So, basically, a link query is like, you put up some operator, and you say, I have a data source somewhere, and I want to apply some operation on it. And basically, what parallel link does is that

25:00

it takes up the query, looks into the query, where which part could be actually parallelized, because it's running slow, or there's a big data set, and that sort of stuff. And actually try to parallelize it. The whole point of parallel link is that if I remove that part, it's actually standard link. And the only modification I did

25:21

to actually parallelize my query was just adding this line. So, you can ever, on any link query, basically, ever you had this small operator, which is as parallel, or you just use the parallel enumerable class instead of the enumerable class to like creating a range, range enumerable, or repeat enumerable.

25:42

That's all. One line change, or even less than one line change, and you have your parallelized query. So, again, demo time. So, a very famous example. If you follow Microsoft guys, you probably have stumbled on a guy which is called Luca Ban.

26:01

And basically, he likes to do crazy stuff. And the last test crazy stuff he has done is this. So, I showed you like a link query, but it was very small. But this is actually a whole raytracer application only written in link, in one big link query. All of that stuff, and actually, it's not finished here.

26:21

You can go down here. All of that is a link query which basically do raytracing. That's what I said. When I said that C-shirt now was going functional, that's what I meant. Yeah. Now, one of the things that he's been saying is the idea is no longer to have side effects, is that everything,

26:41

that basically you don't mutate the global state. Everything is local, so that you can parallelize these things. This is one of the things that you can do. Yeah, of course, yeah. I should say that really functional style programming is really sweet to actually parallelization because when you parallelize stuff, you really don't want to have any side effect

27:01

or shared state. When I said that the programmer basically have to change while line, well, it's not totally true. You also have to be careful in what you're actually doing in your function because if you are doing any walk on a shared state or a traditional collection, for example, you have to protect it with lock or anything. But to moderate that, actually, I showed you that

27:21

we have actually concurrent collection which are very efficient in a multi-threaded environment. Instead of using your traditional collection, just use that and you will have no problem. Again, so basically, you have this enormous query and then what it takes to parallelize it is basically just changing any variable

27:41

to parallel any variable, which is a one, two, three, four, five, six, seven, eight character change. Actually, the peeling stuff is not really finished. I'm working on it. It should be hopefully ready by Mono 2.8. But anyway, I'm going to show you what it does look like at the moment.

28:13

So here I can select if I want to run the, to actually process the thing in sequential manner of parallel manner. So I will just first do it in a sequential way.

28:24

So it's always black. Okay, so it's coming. So actually, notice all the things processed. Basically, it's like a for loop, you would say, and it basically for each pixel process and calculate what the color should be. In a sense, it's that. So actually, as you can see,

28:41

it's pretty much going it like in a linear fashion. And that actually take up some time, a lot of time. The thing is that basically at that part, you have all the reflection which is going on, which is a lot of actually recursive function.

29:02

So that's actually the art part of the picture. Well, I'm going to stop it here. So it's really just for you to see how parallel effects works or peeling works. So now if I select the parallel version and render it, basically what you are going to see after some time here,

29:23

you actually see that the image is being not processed like as a wall. We have some black line here, which actually it showed that there is actually two thread doing the work. Basically what peeling does is that it takes up the wall query, see that you can partition the range of pixel of the screen,

29:44

and then it fire basically a task to process alternatively each part of the screen. So basically this thing is twice as fast, again on my dual core computer, because I have two cores, it's twice as fast as the other way. You're actually seeing it is quite faster than the other.

30:06

Again, take up some time. So yeah, let's speak about state of things. So actually parallel effects, so as I said, it's quite an old project now because it's actually, I have spent like this summer, my second summer of code on it. So it's actually a two-year project,

30:21

two-year whole project. And basically the first version of Mono to include some parallel effects bit was like 2.6, which was released some months ago. I'm not really sure about the date anymore. And so it shipped with a .NET 4 Beta 1 AP, and you get all the tasks, future parallel loops, concurrent collection and coordination that are suited.

30:41

So pretty much like, I would say 80% of the framework. The only part missing is actually the peeling stuff I'm still working on. And actually in Mono tranq, so about a weeks ago, I completed like the Beta, so between .NET 4 Beta 1 and .NET 4 Beta 2, there is some change in the happy.

31:02

And basically in Mono tranq now, the happy is fully Beta 2 compliant. And at this point, normally between .NET 4 Beta 2 and .NET 4 Final, there shouldn't be a lot of change. So actually if you want to start already using parallel effects, you're probably safe by using the tranq version already, the AP of the tranq version.

31:20

And so hopefully by Mono 2.8, we should have fully .NET 4 compliant IP with peeling. So thanks for your attention. And if you have any questions, I'll be happy to answer.

31:45

Can we, every four loop in our program that doesn't have side effects, can we sprinkle it on it? Or is there an overhead penalty that we should take care of? The thing is that parallel program isn't magic. So actually maybe sometimes it's actually, the cost of actually scheduling task

32:02

is more expensive than the parallel optimization. So I would say that you are safe if you are processing big chunk of data. So like image processing, well, a lot of it. Basically like we were talking with Bertrand the other day, and there is like in Banshee, for example, you have an extension which is called Mirage.

32:23

And basically what Mirage does is that analyze all the song you have in your library, your music library, and then it compute basically the similarity between each of your song. So actually in this kind of application, so like a fairly standard desktop application, each of this processing of,

32:41

I mean, each song processing actually could be represented by either a task or a future. I would say more future. So you could have, even if you have like 10,000 song in your library, you have actually 10,000 future, and it's only going, it's not going to create, I mean, you couldn't use like a thread pool, because it would be like creating 10,000 thread,

33:00

which probably will blow up your machine. But if you use parallel effects future, basically you are only running on two threads, and you are still processing your 10,000 songs. So that's actually a neat usage of future in like a desktop application. So that should be.

33:21

So actually, don't take my word, see for yourself, try it in your application. The best way to actually see if it's a bit of a thing is to try it in your application. As I hope I have shown you, it's really easy to actually plug any parallel behavior in your existing application. So test and report any bugs. That's important. Yep.

33:45

Yeah. So actually, yeah, there is like several heuristic in parallel link. Basically, if you feed up with like a collection, which can work on an index fashion, it's going to basically strip the whole data range

34:01

in small chunk that you can actually access. That is the fastest way actually to use parallel link. If you are using it with like an array or a list or basically anything which you work on an index, it's going to be really fast because we can access each element of the collection without any overhead. I mean, when we are doing this kind of partitioning, there is no lock, there is no cast, there is nothing.

34:21

Everything is cleanly, I mean, each task is cleanly separated in its own environment, you could say, and has its work to do. That's all. But again, yeah, if you use like a standard enumerable, there you must have some locking because you can't know what the user is actually doing in its enumerable. It could be like another link query, for example.

34:41

So maybe it's not thread safe. So in that case, we have to use, that's a shame, but we have to use locking. You said it takes into small chunks. Chunks, yeah. How do you decide what is small? Actually, yeah, that's, well, the thing is that, so basically, like if you, at first, for example,

35:03

like in the ray tracer I showed, basically all it would split up data is like it would have, one task would have done the upper part of the image and the other task would have made the lower part of the image. But as you saw, actually, on the ray tracer example, the upper part is really fast

35:20

because there is no reflection inside. But the lower part has more reflection, so it's actually, it takes longer to process. So at one point, you will have one task which run out of pixel to process and will die. So actually, the thing is that we have to make chunk, smaller chunk, so that as you saw, it's like doing chunk like this at the time

35:42

because in the end, hopefully, it's going to stabilize on the same amount of work between the two. The thing is that there is no, I mean, precise way to actually know all along a given function is actually going to take, how much time it's going to take.

36:00

So we have to make some guess here, and it's not perfect. One thing that is worth noting is that the heuristics built into the system, you know, they will take into account things like the load average. So if your machine is overloaded, it will not spin up a lot of threads until the load comes down. So it will take those things into consideration. The second thing is that the final version of the API allows you to plug your own heuristic system.

36:23

So you know your data better. So you can give the scheduler a hint, and you say, well, I actually know that this, you know, the beginning is going to be really hard. So at the beginning, just separate one and one and one. But when you get to number 100, start giving chunks of 20, for example. So you can plug that into it if you want.

36:42

And actually, as I said, Pelang is not just about executing the query without knowing it. It actually, there is a step of actually checking how the query looks like because for example, you have some operator, link operator, which are, for example, take, which only, I mean, are interested in like the first part of the data set, which is processed. So it doesn't care about all the part

37:01

that is at the end, for example. So in that kind of case, if, for example, the checker see that there is a take operator in the query, it will just, you know, use chunk partitioning, but like with one element at a time. That way take is most, I mean, take will probably be happier with what he's seeking. And you won't waste time on what he doesn't really,

37:22

is interested in.