We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Writing Shared Memory Parallel Programs in Ada

00:00

Formal Metadata

Title
Writing Shared Memory Parallel Programs in Ada
Subtitle
Multitasked Newton's Method for Power Series
Title of Series
Number of Parts
490
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Tasks in Ada are effective to speed up computations on multicore processors. In writing parallel programs we determine the granularity of the parallelism with respect to the memory management. We have to decide on the size of each job, the mapping of the jobs to the tasks, and on the location of the input and output data for each job. A multitasked Newton's method will show the effectiveness of Ada to speed up the computation of power series. This application belongs to the free and open source package PHCpack, a package to solve polynomial systems by polynomial homotopy continuation.
Power (physics)Series (mathematics)Parallel computingComputer programRead-only memoryMathematicsComputerComputer multitaskingComputational physicsComputer fontStatement (computer science)Thermal expansionMultiplicationBefehlsprozessorCodePerformance appraisalLinear mapBlock (periodic table)TrianglePhysical systemComputer multitaskingCartesian coordinate systemSelf-organizationMathematicsSemiconductor memoryNeuroinformatikPoint (geometry)Parallel computingPower (physics)ArmComputer fontComputer animation
ApproximationCurveSpacetimeSeries (mathematics)Thermal expansionDegree (graph theory)CurveNichtlineares GleichungssystemPoint (geometry)Moore's lawSeries (mathematics)Thermal expansionPolynomialFigurate numberPower seriesBendingTerm (mathematics)ParabolaPower (physics)Position operatorSpacetimeMultiplication signComputer animation
ReihenlösungLinear mapPhysical systemParallel computingPartial derivativeDerivation (linguistics)Matrix (mathematics)Nichtlineares GleichungssystemVariable (mathematics)Series (mathematics)Degree (graph theory)Error messageMultiplicationRead-only memoryDegree (graph theory)Partial derivativePhysical systemPower seriesNeuroinformatikMultiplication signCurveTerm (mathematics)Core dumpNichtlineares GleichungssystemDimensional analysisDifferential (mechanical device)Computer virusArmComputer animation
ApproximationCurveSpacetimeThermal expansionSeries (mathematics)Degree (graph theory)CurveDegree (graph theory)Nichtlineares GleichungssystemMultiplication signIntegrated development environmentComplex (psychology)Computer animation
Computer programReihenlösungLinear mapPhysical systemParallel computingMatrix (mathematics)Partial derivativeDerivation (linguistics)Nichtlineares GleichungssystemVariable (mathematics)Degree (graph theory)Series (mathematics)MultiplicationError messageBefehlsprozessorComputerThread (computing)System programmingWorkstation <Musikinstrument>Core dumpLaptopCoprocessorDuality (mathematics)IntelLaptopCoprocessorPoint (geometry)Degree (graph theory)Workstation <Musikinstrument>Web 2.0Core dumpServer (computing)Right anglePower seriesCodeRound-off errorMultiplication signComplex (psychology)CurveParallel computingLimit (category theory)Observational studyComputing platformSingle-precision floating-point formatArmNeuroinformatikProcess (computing)VideoconferencingState observerVirtual machineInterface (computing)SoftwarePosition operatorRoundness (object)Power (physics)Computer animation
MathematicsComputerParallel computingComputer programRead-only memorySeries (mathematics)Shared memoryProgrammable read-only memoryMultiplicationBefehlsprozessorComputerThread (computing)System programmingWorkstation <Musikinstrument>Core dumpLaptopCoprocessorDuality (mathematics)IntelMoment (mathematics)User interfaceWeb 2.0Server (computing)Virtual machineSoftwareArmThread (computing)Core dumpResultantConfiguration spaceMultiplication signComputer animation
Task (computing)CodeSummierbarkeitComputer programRead-only memorySemiconductor memoryThread (computing)Vector processorCondition numberVariable (mathematics)SpacetimeAttribute grammarGraph (mathematics)SynchronizationSystem identificationNumberProcedural programmingGeneric programmingPower seriesParameter (computer programming)Process (computing)Task (computing)Queue (abstract data type)Mechanism designCodeMemory managementThread (computing)CoroutinePower (physics)Pointer (computer programming)Data structureShared memoryParallel computingLimit (category theory)SynchronizationPoint (geometry)Right angleImplementationPhysical systemLibrary (computing)Multiplication signRoboticsArmConfiguration spacePosition operatorAngleLengthRevision controlUniqueness quantificationDrop (liquid)MultiplicationVector processor1 (number)NeuroinformatikSemiconductor memoryNominal numberNormal (geometry)Graph (mathematics)Differential (mechanical device)Cartesian coordinate systemCore dumpComputer animation
Performance appraisalSeries (mathematics)Derivation (linguistics)Partial derivativeParallel computingMultiplicationProduct (business)Computer programLinear mapBlock (periodic table)TrianglePhysical systemMatrix (mathematics)Positional notationParalleler AlgorithmusThread (computing)Beta functionLine (geometry)Inverter (logic gate)Partial derivativeCoefficientMatrix (mathematics)Parallel computingInverse elementSlide rulePhysical systemIndependence (probability theory)LinearizationBlock (periodic table)Power seriesRule of inferenceProduct (business)TriangleComputer multitaskingPoint (geometry)Variable (mathematics)Power (physics)Square numberOperator (mathematics)NumberNeuroinformatikTask (computing)Overhead (computing)MultiplicationCore dumpPhysical lawDifferential (mechanical device)Integrated development environmentArmInferenceRight angleStructural loadClient (computing)Series (mathematics)Computer virusDecision theoryComputer animation
Series (mathematics)BenchmarkRootTask (computing)Core dumpIntelQuadrilateralLaptopDegree (graph theory)Cartesian coordinate systemTask (computing)BitMultiplication signPoint (geometry)2 (number)Hand fanServer (computing)Decision theoryPower (physics)BenchmarkPhysical systemMathematicsGroup actionArithmetic progressionArmNumberSystem callOperator (mathematics)MultiplicationPolynomialNonlinear systemParallel computingPower seriesComputer animation
Complex numberCompilerFitness functionHypercubeClosed setLinear algebraPerformance appraisalPersonal area networkMaizeSystem callVector processorCoprocessor2 (number)Lecture/Conference
BefehlsprozessorCompilerCore dumpLaptopDuality (mathematics)IntelProcess (computing)Streaming mediaMessage passingComputer animation
Point cloudOpen sourceFacebook
Transcript: English(auto-generated)
Our next speaker is Yann Verschuld. I don't know if you know how to pronounce it in the English way. It's okay. He's coming from the USA, University of Chicago, Illinois.
Thank you very much. I would like to thank the organizers for having me here again. So, this is a talk on the mathematical application of ADA, of in particular, multitasking.
So, I'm the only one in this session with a subtitle, and I'm glad that this didn't scare you off. So, I'm glad you came. So, here's my outline. So, the main point of my talk is that ADA works
extremely well to define shared memory parallel programs. And the whole point is that you really get to do what you want to do. So, there is some mathematics on there. I will try to use the mathematics mainly to
indicate how you can tune your application. So, computers are getting faster and faster, but that's actually creating a problem for us. Not that we have no things that don't go fast enough,
but making computers work efficiently is actually quite the challenge these days. So, here is the motivation. So, this is the picture. The picture is kind of misleading. What we actually are always looking at are polynomial
equations. So, we have two polynomials in three unknowns. They define a space curve. So, the space curve is this figure eight. So, it's kind of bended. And you are positioned at the top, the point 0, 0, 2. So, the point 0, 0, 2 satisfies these two equations.
You know one point on a curve, and you want to continue. You want to see what lies ahead. So, you're going to compute power series expansions. The red one is the trivial one. So, you think if you only do one term in your
exponentiation, you think it's kind of a parabola. But if you take more and more terms in that power series expansion, already actually the next one is here already very interesting. You see the crossing point. So, this is where interesting things start to happen. And that crossing point as you take more and more terms
in your power series actually starts to approximate the real crossing point. Now, this is a misleading picture. We only see the equations. And actually, the crossing point is where typically your power series will no longer converge somehow.
But here actually, they allow you to say something about the behavior of that curve. Now, the point of the talk is to compute this efficiently, and we run Newton's method all the time. So, Newton's method is probably the Newton's method
that you have seen in high school. Also, with power series curve symbolically, as you have to do this by hand, you can actually compute term by term. What you need to do is you need to evaluate polynomials over and over again. And you also need to know all the partial derivatives.
So, these are the, and there's 10 solving linear system. So, these are the three things that you have to do. You have to evaluate, differentiate, and solving a linear system. And it converges quadratically. So actually, Newton's method is a very promising method. But there are a lot of computational difficulties.
So, if you do this by hand, you quickly gonna give up once you get past, even already for one variable, it might be hard. But our curves, they live in any dimension. So, the degree of the curves,
so I should have pointed this out previously. So, this is a quartic, a curve of degree four. So, you have two equations of degree two. If you have three equations and four variables, if there are quadratics, the degree is gonna be eight. If you have four variables, five equations, the degree is gonna be 16. So, you can see that if it starts to get very quickly,
every time you increase one of the dimension, your complexity of your problem actually doubles. So, with 10 variables, 11 variables, 10 equations, you may have a curve of degree 1,000. Now, even if you would fix a, I'm not gonna go past 10,
then actually, you have your power series. And actually, they go on forever. So, there's no limit there. We have some power series that converge rapidly. We're lucky. But some power series converge very, very slowly. So, we may not know how far we need to go.
And that was actually the motivation for this study as well. We want to have something that's running fast and allows us to compute the large, large degrees. And so, even if you would say, I only compute eight terms, well, then there is the multi-precision arithmetic that you may need. So, high degrees, things start to curve.
The round off starts to creep in. Double precision is unlikely to be sufficient. So, these are the three motivations that we have for using parallel computations. And the code was developed on three different platforms. So, the laptop that I have here with me is the middle one.
It's kind of the middle between these other two configurations. So, first of all, the obvious point is that you can't have single core processors anymore. All your processors are multi-core.
And the other point I would like to make is that while it looks very interesting and appealing to have a workstation like that sitting on your desk, but having it fully, so this is also the workstation that is serving our web server. So, perhaps I should have said that right in the beginning.
We also actually launch a web server. So, this machine is actually hosting the web interface to our software. So, it has 44 cores and I get the best results with 88 threads.
So, but I will restrict to the middle configuration. And here is the ADA code. The ADA code that I have been using already now for a very long time. And then I also actually have other people have used. So, I saw a recent paper that was published
in composite structures of lamination design. So, if you want to know where polynomial systems occur in practical applications, if you have to design a robot arm, for example, a robot arm can take
possible configurations. So, you can already see this with the elbow that I have. You want to reach a certain position and then you want to compute all possible angles of your arm so that you can reach these positions. Now, you don't want to twist your elbow.
So, all the polynomial equations, they express that the lengths of your mechanism, they have to stay fixed. So, the methods that I'm using often very well known into mechanical design. And my users, they simply download typically the binary version.
And if you have minus T, it uses the multi-threading. And it actually runs this very simple procedure. It launches threads and every task has a unique identification number. And that's that generic procedure. So, this is a generic procedure,
the body of generic procedure. And the argument, the generic parameter is the procedure job. So, job is what I always provide. And I can tell each task specifically what to do based on its identification number. So, tasks actually work on job queues.
Okay, so if you want to write multitasking code, there are two main issues that you have to consider. The first one is memory. So, every task has its own stack, but they all share the same heap. So, if you start allocating and deallocating,
you better do that outside the task routines. So, everything that you do on these power series, you have to allocate the auxiliary vectors outside and pass those pointers to the auxiliary data structures
in the arguments of the jobs. So, the second thing that you have to worry about is the granularity of your computations. So, there should be actually enough parallelism. So, typically you have a limited number of tasks, but the tasks actually need
sufficiently many jobs to do. Sometimes you need to synchronize. The easiest way to synchronize is just, so, synchronization means that there is a piece and all the tasks have to wait up to a certain point to continue. The easiest way is actually to stop the tasks and then to relaunch them.
So, that's the way we do the synchronization right now. So, these are the two main issues. I actually have three implementations already of my power series library. So, don't give up. So, you have, first of all, the functional correctness, but then you have to think of it
in a whole different way when you think of the shared memory parallel computation. And then, of course, there are always other issues. I will point out some other issues. Okay, so now, what do we do? We evaluate and differentiate.
This is something that I also did only learn when I was doing the parallelism. So, you may have seen rules to differentiate or compute all partial derivatives. The cool thing is that if you have a product of N variables,
if you do this symbolically, it will be an N square operation. But you can actually do this in the linear time, linear in the number of variables. You can see if you take this simple product of variables here, if you have the stars, so you can compute. So, the left here are actually the names of the variables,
so the kind of funny names. But at the right, you see all the star operations. The point is also the star is not the star of numbers. So, we multiply power series with each other. And the coefficients, again, are typically multi-precision numbers, double-double, quad-doubles. So, there's a lot of arithmetic overhead here.
So, you may wanna save on the number of multiplications. The bottom line here is that this is kind of the saving step in what makes everything run very well in our parallel computations. There is a straightforward parallelism.
So, a polynomial, so the name says it itself, there are many, many, many, many monomials. And you can compute all the monomials independently. So, you have a very straightforward parallelism here. So, that's the first mathematical idea. The second one is that
you can work with a matrix of power series and invert that, that's all very fun. But actually, what you should do is you should look at a power series where the coefficients are matrices. And I worked here the simplest example. We are going to solve a linear system. So, the matrices are our partial differentials
that we have computed. So, they are power series again. Now, if you linearize it, we arrive at the block triangular system. So, even if you, for example, you can invert power series, polynomials, they have no inverse. Power series, they do. That's kind of the cool thing. And this slide also actually shows this.
So, you have to convert, if you solve the inverse, so you actually are after the updates in your Newton's method. So, you have to invert the matrix A naught. And once you invert the matrix A naught, you adjust the right-hand sides. And here's where the pipelining comes in.
So, you do the first update, the delta X naught, and then you update the right-hand sides, the B one and the B two. And they can happen independently. And this is where your parallelism comes in. So, here you can have two tasks working independently in the second step.
So, here you have a maximal speedup of two. No matter how large your matrices are, if you do this with this very coarse granularity. Now, of course, if the matrices start to get larger, you better use a multitask QR. So, we have also played with that.
So, this is kind of the bottleneck at this point. So, at this point, this is still a work in progress. So, these are timings that I did yesterday in my hotel room on this laptop, trying to see what happened, how far I can go
with the degrees of the truncated power series. So, this was done in quite double arithmetic. It was done on a 10-dimensional benchmark system. If you're really curious, you can figure out, via my websites, what the polynomials really look like.
So, with degree 16, I doubled the number of tasks in each step. So, at best, I can get close to four. So, the polynomial system is actually very, very mild, as far as nonlinearities go.
You can read these columns from top to bottom, but I also like to read them diagonally. If you go from degree 32 to 64, you kind of double the size. Now, this is not a linear operation, these multiplications of power series.
So, you almost get a 10-fold, overhead, if you want to do that. Now, if you then use your 16 tasks, this is where you have kind of the speed spot, where with 16 tasks, you still get a little bit more of speed up. Now, the time actually then doubles. So, from the 2.3, you go to the 5.2 seconds.
Yeah, and at that point, so, this is ongoing work. For this laptop, the fan starts to get blowing at 64 degrees, and I don't want to exhaust that through a laptop.
But on a bigger server, it's actually much more challenging, because then also the precision, the double precision starts to deteriorate a little already. So, we're still working on getting the precision fixed. But as far as the parallelism goes,
it's fairly simple to implement. And you can focus on the mathematical difficulties. And also, whenever you have an application, you can focus on really what matters. And we have five minutes left for questions. Thank you very much.
Yes, so the question, yes, so the question is,
how do we actually get the best speed up at 88? So, the processors indeed support hyper-threading.
But then I don't get 88. I think I get, the best I get is close to 60 somehow. And this is really for the polynomial evaluation, where every monomial can be devaluated independently. Yes. Second question is, what are we,
how much of the vector instructions? Oh, that's a very good question. How much is used of the vector instructions? That I don't know. Well, I know when I do my linear algebra column-wise,
like the Fortran does, then I get better performance. But the performance actually then deteriorates when I use complex numbers. So if I do floating-wise column-column,
then the compiler still can actually do the vectorization correctly. But I still have to figure out how actually the compiler will figure it out for complex numbers. And then there are also complex double-doubles, complex quad-doubles.
So that's then another challenge, yes. Thank you. We have a few minutes before you leave.
I remind you that all of this bedroom has been organized by Dirk Breihennest, who unfortunately is not here. For those who were not here at the beginning, he broke his light, so I took over his duty. But he did a terrific job in organizing things
and getting the speakers together and getting the room again. And I got the message that he's watching us on live stream, so please, big applause for Dirk.