Performance Python for Numerical Algorithms
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Part Number | 62 | |
Number of Parts | 119 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/20050 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Place | Berlin |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
00:00
3 (number)AlgorithmComa BerenicesCorrelation and dependenceAlgorithmAreaFood energyLatent heatDomain nameRight angleData managementWordComputer animationLecture/Conference
00:42
Metropolitan area networkHand fanMathematicsMenu (computing)Duality (mathematics)SicInterior (topology)Berlin (carriage)Conditional-access moduleWide area networkWeb pageMaxima and minimaValue-added networkUniform resource nameComputer clusterComputer programmingCAN busComa BerenicesGastropod shellLibrary (computing)Decision theoryLaptopData managementComputer fileComputing platformSurjective functionSuite (music)Information managementImplementationProcess (computing)Right angleLibrary (computing)ImplementationRight angleCartesian coordinate systemSet (mathematics)Physical lawWordElement (mathematics)Self-organizationLimit (category theory)Arithmetic meanPattern languageMoment (mathematics)Analytic setSemantics (computer science)AdditionUniverse (mathematics)Data managementProgramming paradigmView (database)Point (geometry)Gastropod shellComputing platformAreaThomas BayesDecision theoryWeb applicationE-bookReduction of orderMathematicsDemosceneSuite (music)Revision controlComputer fileLaptopBitPresentation of a groupComplete metric spaceComputer animation
03:35
Special unitary groupSineStatisticsRow (database)RippingFunction (mathematics)Programmable read-only memorySet (mathematics)Numerical analysisSurjective functionExpressionRootDifferent (Kate Ryan album)BitProgramming paradigmPattern languageCASE <Informatik>Absolute valueDemosceneFunctional (mathematics)Library (computing)Context awarenessPairwise comparisonImplementationAreaNumerical analysisProfil (magazine)Transzendente ZahlSimilarity (geometry)Computer animation
04:35
Function (mathematics)CAN busRange (statistics)Computer clusterRange (statistics)Object (grammar)ExpressionFunctional (mathematics)Numerical analysisElectronic mailing listSign (mathematics)Complex (psychology)Nichtlineares GleichungssystemAreaCASE <Informatik>Loop (music)Line (geometry)MathematicsSingle-precision floating-point formatComputational scienceMassProgrammschleifeSet (mathematics)PhysicalismDialectComputer animation
05:57
Uniform resource nameArithmetic meanAmsterdam Ordnance DatumCAN busSpecial unitary groupArtificial neural networkTask (computing)Performance appraisalRegulärer Ausdruck <Textverarbeitung>OvalImplementationElectronic mailing listInterpreter (computing)Programming paradigmObject (grammar)Virtual machineCodeDifferent (Kate Ryan album)Functional (mathematics)PiThread (computing)CASE <Informatik>BitResultantMultiplication signBenchmarkNumerical analysisSingle-precision floating-point formatIterationLine (geometry)String (computer science)Social classCore dumpExpressionComplex (psychology)Library (computing)Operator (mathematics)2 (number)File formatLatent heatFormal languageVector spaceAreaDerivation (linguistics)Set (mathematics)Data structure1 (number)Rule of inferenceTerm (mathematics)Expected valueCodierung <Programmierung>Universe (mathematics)PlanningBit rateComputer animation
09:20
Compilation albumMetropolitan area networkFunktorFunction (mathematics)Artificial neural networkQuantum stateSpecial unitary groupNetwork operating systemSynchronizationComa BerenicesMenu (computing)Ultraviolet photoelectron spectroscopySystem on a chipExecution unitThermische ZustandsgleichungRead-only memoryImplementationMultiplication signPairwise comparisonThread (computing)ExpressionFunctional (mathematics)Object (grammar)Formal languageRevision controlProgramming paradigmDifferent (Kate Ryan album)Electronic mailing listType theoryNumberNumerical analysisTotal S.A.1 (number)Message passingResultantRoutingWordAlgorithmInstance (computer science)Computer animation
11:35
Amsterdam Ordnance DatumRead-only memoryArray data structureEmulationData storage deviceIcosahedronSurjective functionAreaConditional-access modulePoint (geometry)3 (number)Different (Kate Ryan album)BitRow (database)Semiconductor memoryNumerical analysisRepresentation (politics)1 (number)Order (biology)Object (grammar)DampingMultiplication signAlgorithmCASE <Informatik>Data storage deviceView (database)Type theoryUniform resource locatorLinearizationUniqueness quantificationDemosceneFreezingTwo-dimensional spaceAxiom of choiceQuicksortBit rateComputer animation
13:20
Read-only memoryOrder (biology)EmulationValue-added networkZoom lensVarianceMetropolitan area networkSpecial unitary groupSurjective functionUltraviolet photoelectron spectroscopyIRIS-TStandard deviationPersonal area networkComa BerenicesUniform resource nameLoop (music)Arithmetic meanWide area networkCovering spaceCore dumpModule (mathematics)Data modelSystem callLevel (video gaming)ImplementationAlgorithmMassFunction (mathematics)FunktorRaw image formatCAN busRoyal NavySequencePlane (geometry)SummierbarkeitComputer configurationDot productMaxima and minimaExecution unitArtificial neural networkParallel portCodeLocal ringLaptopClient (computing)Normed vector spaceInformation securityDifferent (Kate Ryan album)Element (mathematics)Multiplication signOrder (biology)Object (grammar)Exterior algebraSummierbarkeitResultantView (database)CalculationCASE <Informatik>VideoconferencingProfil (magazine)PercolationSampling (statistics)State of matterForcing (mathematics)Point (geometry)AreaCuboidGeometryOperator (mathematics)Group actionElectronic mailing listSocial classType theoryConstructor (object-oriented programming)Parallel computingInteractive televisionLine (geometry)Resolvent formalismCartesian coordinate systemStandard deviationBenchmarkSimulationAlgorithmFunctional (mathematics)ImplementationComputer simulationDecision theoryRoundness (object)Semiconductor memoryLocal GroupBrownian motionDifferential equationWordNumerical analysisRight angleGraph coloringParallel portVector spaceLibrary (computing)PhysicalismLaptopFunction (mathematics)Client (computing)1 (number)Scaling (geometry)Service (economics)Point cloudLatent heatGoodness of fitSystem callEndliche ModelltheorieCodeComputer configurationCloud computingStochastic differential equation2 (number)Parametrische ErregungAttribute grammarContext awarenessRow (database)Loop (music)BitDefault (computer science)Computer animation
20:34
Maxima and minimaWide area networkDenial-of-service attackSpecial unitary groupMetropolitan area networkUniform resource nameSpherical capFunction (mathematics)Computer-aided designPoint cloudValuation (algebra)Compilation albumRule of inferenceSystem callPlot (narrative)Ultraviolet photoelectron spectroscopyPairwise comparisonCAN busSupremumComa BerenicesVacuumCodeVertex (graph theory)Module (mathematics)Term (mathematics)Value-added networkSimulationAmsterdam Ordnance DatumLemma (mathematics)Execution unitParameter (computer programming)LaptopImplementationMobile appRootDesign of experimentsPersonal area networkHill differential equationArtificial neural networkCore dumpSummierbarkeitWeb pagePointer (computer programming)NumberCompilerAerodynamicsDifferent (Kate Ryan album)SequenceStatement (computer science)Numerical analysisSet (mathematics)Parallel portAlgorithmParameter (computer programming)outputOverhead (computing)Functional (mathematics)ResultantWater vaporCartesian coordinate systemCalculationPhysicalismLoop (music)Parametrische ErregungParticle systemProcess (computing)CASE <Informatik>CodeThread (computing)Total S.A.Attribute grammarImplementationCore dumpMetadataVirtual machineInformationSingle-precision floating-point formatLaptopOperator (mathematics)2 (number)Real numberWordMultiplication signError messageFunctional programmingSimulationLocal ringModule (mathematics)Line (geometry)Ultraviolet photoelectron spectroscopyExpected valueComputer simulationScaling (geometry)Point (geometry)HypercubePairwise comparisonLevel (video gaming)DivisorMultiplicationObject (grammar)Software testingSeries (mathematics)Task (computing)Electronic mailing listSocial classMehrprozessorsystemBitPolarization (waves)Server (computing)MereologyDirection (geometry)State of matterConcentricRun time (program lifecycle phase)Computer programmingBuildingGroup actionTerm (mathematics)Projective planeCausalityMathematicsOpen sourceRing (mathematics)Performance appraisalComputer animation
27:27
Uniform resource nameCompilerContext awarenessOpen sourceValue-added networkCAN busSpecial unitary groupRaw image formatProgrammschleifeOnline helpPersonal area networkMetropolitan area networkRoyal NavyTotal S.A.Menu (computing)Object (grammar)Read-only memoryComa BerenicesSummierbarkeitOverhead (computing)Ultraviolet photoelectron spectroscopyGamma functionSupremumWide area networkPersonal identification numberPort scannerCompilation albumTap (transformer)Thermische ZustandsgleichungSystem on a chipSimulationLevel (video gaming)Price indexConditional-access moduleForm (programming)Array data structureReal numberRoundingLoop (music)Information managementHecke operatorSystem callVarianceCASE <Informatik>Multiplication signArmStudent's t-testNumerical analysisLoop (music)AreaCodeCompilerMaxima and minimaProgrammschleifeOverhead (computing)Cartesian coordinate systemRoutingPattern languageLibrary (computing)Revision controlWordGraph coloringFunctional (mathematics)Vector spaceVirtual machineJust-in-Time-CompilerIntegrated development environment2 (number)Run time (program lifecycle phase)Trigonometric functionsVery-high-bit-rate digital subscriber lineProcess (computing)ResultantData structureSemiconductor memoryObject (grammar)CountingComa BerenicesMereologyMathematical optimizationTwitterWebsiteLink (knot theory)ImplementationSlide ruleAlgorithmTerm (mathematics)Limit (category theory)Set (mathematics)Level (video gaming)SpacetimeSimulationOpen sourceIterationBenchmarkTranszendente ZahlArithmetic meanType theoryError messageEndliche ModelltheorieSummierbarkeitProcedural programmingPoint (geometry)Computer simulationCalculationComputer configurationBitPresentation of a groupDigitizingCompilation albumProgramming paradigmApproximationSound effectDivisorBit rateLengthObservational studyArray data structureMachine codeFeldrechnerMoore's lawBinomial heapElectronic mailing listSpeech synthesisLogarithmComplex (psychology)Ultraviolet photoelectron spectroscopyComputer hardwareTouch typingContinuum hypothesisAnalytic setComputer animation
36:28
CAN busMetropolitan area networkElectric generatorMaxima and minimaTwitterMaizeNetwork operating systemRaw image formatComa BerenicesMoving averageWide area networkModul <Datentyp>LaptopPresentation of a groupLink (knot theory)Pairwise comparisonOrder (biology)Observational studyTwitterComputer animation
37:05
Uniform resource nameRaw image formatSystem on a chipMetropolitan area networkRandom numberNumberBefehlsprozessorParallel portService (economics)ArchitectureCodePower (physics)Maß <Mathematik>Division (mathematics)Function (mathematics)CurveCAN busMenu (computing)WorkloadNetwork operating systemValue-added networkComputer hardwareOperations researchBound stateMoving averageSpecial unitary groupSummierbarkeitGrand Unified TheoryObject (grammar)Coma BerenicesDerivation (linguistics)Analytic setFile formatMiniDiscGraphics processing unitCalculationRandom number generationComputer hardwareOverhead (computing)Object (grammar)Queue (abstract data type)Array data structureLipschitz-StetigkeitFunctional (mathematics)Electric generatorMemory managementCodeDifferent (Kate Ryan album)Point (geometry)BefehlsprozessorContinuum hypothesisVirtual machineAnalytic setImplementationFamily2 (number)Limit (category theory)Library (computing)Order (biology)Numerical analysisPower (physics)Declarative programmingJust-in-Time-CompilerFluid staticsPiTable (information)Message passingAlgorithmOperator (mathematics)Set (mathematics)Pairwise comparisonLine (geometry)CASE <Informatik>Revision controlSlide ruleScaling (geometry)Semiconductor memoryMultiplication signRight angleBasis <Mathematik>Physical lawAdditionRegulator geneNatural numberBit rateAntimatterPhysical systemProduct (business)Computer animation
41:53
Core dumpParallel portAlgorithmInformationResultantOverhead (computing)CASE <Informatik>Time seriesCross section (physics)Goodness of fitBitScaling (geometry)Object (grammar)Multiplication signGene clusterLibrary (computing)State observerWordEndliche ModelltheorieType theoryProjective planeVirtual machineMultiplicationOrder (biology)Square numberAnalytic setKeyboard shortcutCartesian coordinate systemCommunications protocolRoutingMehrprozessorsystemLevel (video gaming)Formal languageComputer simulationCodeFiber bundleSerial portStandard deviationPoint (geometry)Task (computing)Arithmetic meanSemiconductor memoryAssembly languageMathematical optimizationClient (computing)Social classBenchmarkRootAdditionSubject indexingTheory of relativitySimilarity (geometry)Real numberSelectivity (electronic)MereologyProcess (computing)Exception handlingSystem administratorDifferent (Kate Ryan album)Machine visionSampling (statistics)Just-in-Time-CompilerSummierbarkeitHigh-level programming languageLecture/Conference
Transcript: English(auto-generated)
00:15
You're welcome to the performance Python talk from America algorithms. My name is Eve hit patient
00:21
I'm found a managing director of the Python quants as the name suggests. We are mainly doing work in financial industry So my examples will be primarily Financial but I think they apply to many many other areas So if you're not from finance You won't have trouble to translate what I present today to your specific domain area
00:41
Before I come to the talk a few words about me and us Set him founding it found a managing partner of the Python quants. It was a lecturer for math finance at Saarland University I'm co-organizer of a couple of conferences and organizer of a couple of meetups actually all center around Python and quant topics I've written the book Python for finance
01:02
Which will come out at Riley this autumn. It's already out as an e-book. I will show it later on and Another book to reduce analytics with Python which will be published by Wiley finance Next year probably apart from Python and finance. I like to do martial arts actually This is the book and actually today's talk is based on chapter 9 of the Python for finance O'Reilly book
01:22
As I said, it's already out as an early release as an e-book and the printed version will probably come out at Well, let's say mid-november is kind of the date. I'm finished with my editing. I hope O'Reilly will Will come up with the printer or the final version pretty soon There's also a course right out now
01:43
on quants up actually Which also covers a topic that are the topics that are present today It's completely online based one. Maybe you want to have a look when you come from the finance industry I think then the benefits are the highest in this area Doing otherwise is at the moment mainly working on what we call the Python quant platform
02:03
We want to provide a web-based infrastructure for quants working with Python and applying all the All the nice things that I present today. I would show it quickly later on maybe with a couple of examples We haven't created a path notebook there. You have a path and chill easy file management BIM editing
02:23
So anything you want to need in addition? We also provide our proprietary analytics suites decision and DX analytics on this platform. That's enough about us No, but a talk. What does this talk about actually? When it comes to performance critical applications two things should always be checked from my point of view
02:42
I'll be using the right implementation paradigm Sometimes this boils down to what is typically called idioms, and I'll be using the right performance libraries. I think Many of you might have heard the prejudice that Python is slow and of course Python can be slow but I think if you're doing it right with Python Python can be pretty fast and
03:02
One of the major means in this regard are performance libraries that are available in the Python world And I can only privily touch upon all of these that are listed here I think there was yesterday the talk by Stefan about the thyson so about any topic that you see here You can have complete talks or even complete tutorials for day or even for a week for some so it's a complex topic
03:25
But my talk is more about showing what can be done The main approach will be to say this is what it was before it was a little bit slow Then we applied this and that and afterwards we see these improvements. We don't go that much behind the scenes We don't do any profiling during this talk
03:41
But you will see in many many cases when it comes to numerics Python and these libraries can help in improving performance substantially Let me come to the first topic Python paradigms and performance As I said what I call paradigm here in this context usually is called idioms for example This is just a function that you see here. Don't try to get the details
04:04
This is just a function I will use regularly and have provided it here in the in the slice and that you can use it afterwards as well It's just a function to compare a little bit more systematically Different implementation approaches and compare performance a little bit more rich richly, but there's nothing special about that
04:20
Let me come to the first use case a mathematical expression that doesn't make too much sense with the square root We have absolute value of transcendental functions in there and a couple of things that are happening there You might encounter these or similar expressions in many many areas as I mentioned before in Finance and math finance you have these you have these and in physics and you name it in almost any science as of today
04:44
You find such a similar numerical expressions We can implement this pretty easily as a Python function. Let's see here It's a single equation. Let me translate this mainly in a single line function. Nothing special about that
05:01
What we want to do however is to apply this particular function to a larger array to a list object in the first case with 5500 numbers actually So this is kind of where usually The computational burden comes in when you have a huge data set and you want to apply this expression to the huge data set It's not that the single equation is kind of complex
05:21
But it's in the end the mass of the computations that makes it typically slow So to start working with we generate indeed a list object using simply range with five hundred thousand Numbers and what we then do is to implement another function which uses our regional function F Implementing the numerical expression where we have a simple looping. This is the first implementation out of six that I present. So this is a
05:47
pretty simple straightforward Function whether is a for loop in there We have another list object and we just calculate a single to single values and append the results to the other list object in the function then returns our
06:03
list with the results a second one Second paradigm or idiom if you like is to use list comprehension Actually, the same thing is happening as before but it's a much more compact way to write the same thing so we generate in a single line of code the list object by iterating over our our list object a and
06:25
Collect the results given the values that the function f returns a little bit more compact Maybe better readable if your pies encoder you might prefer this one We Can also do it this is quite flexible. I wouldn't suggest to do it in that case We will see this will be the slowest one, but it's very flexible when we are working for example with
06:44
Classes objects where we value derivatives and derivatives. They have like kind of complex payoffs and so forth And you can describe these like in a string format It makes it pretty flexible to provide different payoffs for these classes And this is for example one area where we use it But typically when we use it, it's only once that we have to evaluate the expression
07:03
But in this case, you might notice that the expression is evaluated per single iteration over the list comprehension So it's and as we will see this is very intense a compute intense or interpreter intense Way to do it to like every time I iterate over the expression to evaluate it
07:22
This will make it pretty slow as we will see of course if you're working in the merits of science Be used to the vectorization approach of numpy we can do is kind of implement the same thing This this time now on a numpy nd array object which is especially designed of course to handle such data sets and such data structures and
07:45
With a single line of code and using vectorized Expressions we can accomplish the same thing So now we were working on numpy and the array objects and using numpy universal functions To calculate what we interested in The similar to the list comprehension syntax
08:02
But in the end we would hope for speed improvement because this is especially designed to exactly do these kind of operations Then we can also use a dedicated specific library, which is called numx For numerical expressions here in this case. We again provide the whole expression as a string object, but in this case
08:23
Actually, what happens is that this string object this expression is compiled only once and then use afterwards and here again We are working on numpy and the array object so a numx is especially designed to work on numpy and the array object So in this case you would also see hopefully some kind of improvement because it's kind of a dedicated specialized library
08:42
to attack these kind of problems You might have noticed that in the first example I have set the number of threats to one to have kind of a benchmark value We're only using one thread one core in this case here. I increase the number of threats to four So if you have a four core machine You would expect to your kind of an improvement, but what kinds of improvement let us have a look in summary again
09:07
we have six different paradigms or idioms used with Python and in the end, this is kind of Delivering in any case the same result. So like as it's often the case and when you see people coming from other languages
09:21
Coming to Python being new to Python not knowing all the idioms They are using probably those that they are used to from other languages like C or C++ You name it and and sometimes this can be a pitfall in a sense that They're using maybe the wrong paradigm the wrong idiom, but let's have a look what the differences are
09:41
Now our comparison function comes into play and we have a clear winner. Obviously, we have a multi-threaded version F6 was last one where we're using multiple threads to evaluate the numerical expression on the array object Then we have to single threaded one, which is the second fastest and the third one is the numpy version
10:03
And then the Python ones follow after that So we see actually this kind of a given the list comprehension For example, we have a 28 times increase in performance using the multi-threaded Numix version And as I mentioned already before the f3
10:20
This was the one which uses the built-in eval function of Python You see that we have a speed up in total here of 900 times So these can vary of course depending on number threads are using and so forth But the method and the message I think should be clear we have in Python many many ways to attack The same the very same problem and all the ways will yield the same results
10:41
But there might be considerable performance improvements when going the right route and avoiding pitfalls and especially avoiding Implementations that are per se compute intensive. So this is for example, where profiling would come into play I don't represent it here I said my approach is more like this is before then we do something you compare it and this is what is afterwards
11:02
So but profiling of course would have reread that eval is kind of a very time-consuming function and most time spent for example with f3 in this type of implementation Let me come just briefly to a rather subtle thing when we've seen the numerical
11:22
Algorithms implemented based on numpy and the array objects be it directly by the use of numpy Universal functions or by using num expression Have been the fastest but this kind of in certain Circumstance I and I encountered that quite a while ago and it was first like a little bit like I didn't know what's going on
11:42
But later on it became pretty clear what was going on So it's from my point of view worth mentioning depending on the specific algorithm that you're using That memory layout can play indeed a role when it comes to the performance Let me start with a typical numpy and the array object
12:00
Which we instantiate by providing the D type in this case float 64 and here the order or the memory layout comes into play We have two choices with numpy. There's C for C like Layout and F for Fortran like layout in this case. You don't see any difference. There's nothing special You see just the numbers printed out But don't get confused because this is just graphical representation of what data is stored actually at the moment
12:24
But if we have a an array like this you can explain what memory layout is all about actually When we have to see like it has a row vice storage We provide you the order C This means that the ones the twos and the threes are stored next to each other
12:42
So this is how in memory I mean memory is a one-dimensional thing So we can't store it given a unique location in memory So we don't have kind of two dimensional n-dimensional things where we can store data into it's kind of linear thing So we have to decide how to put multi-dimensional thing things into memory
13:00
And this is how is it stored when you use the order of C? Using the order F then in this case We have a column by storage which means that the one two three the one two three and the other one two threes Stored next to each other working with such small numpy and the array objects doesn't make That that much of a difference But when you come to larger ones and in particular to asymmetric ones like this one where you see we have three times
13:27
1.5 million elements in there then we can expect some differences in performance We accentuate two different and the array objects here the one with the order C Of course and the other one with F to just compare it
13:42
Where we now start calculating sums for example the C order you see already kind of a difference When you're calculating the sums over the different axis So numpy is of course aware of the axis list objects when you construct like two dimensional things with list objects There is no awareness. So it's kind of no attribute
14:00
For the axis, but in this case we can calculate the sum Row wise or column wise if you like and you see there's kind of a huge difference Here like a 50% difference when it comes to the two different axes only the performance of calculating some The one delivers back kind of 1.5 million one-dimensional result
14:20
The other one returns a result which has only three elements in this case, but of course in America operations are running differently over memory for both cases For Standard deviations you observe the same thing So according to results here going along axis one Which means the second axis, of course with the zero numbering is much much faster than the other way around
14:45
So if you have these problems and you have to implement it might be worth considering really doesn't make sense to have a three times 1.5 million array or 1.5 million times three array So you will see considerable performance improvement going to one way or the other depending on what you're exactly
15:01
Interested in when it comes to the calculations Third sums with the DF order and the array object to see These operations are actually both slower They absolutely slower than the C order operations, but to see different relative performances So in this case doing the sum according or long axis zero
15:22
Which means the first axis is faster relative to the other axis The same actually holds true. Not really. This is pretty close for the standard deviations and you see this is also The absolute disadvantage might be due to the fact that C is the default and the C world in numpy is a little bit
15:40
More maintained or more important, but once you have to for example interact with the photon world and you are like Required so to say to work with the F order that it might make sense again to consider The question is three times 1.5 million better or 1.5 million times three. So you will see in certain cases
16:00
huge differences Let me come to another approach Which is usually and I think all the approaches that I present today are like in a certain sense low-hanging fruits There's typically not that much involved when it comes to for example The redesign of the algorithm itself, so I don't do any redesigns of algorithms
16:22
I'm always sticking to the same problem to the same implementation and then showing what it can do Sometimes you will see that of course using different libraries You need to rewrite something but it's not about the for example parallelization of a certain algorithm what I will present now It's more like given a certain implementation of an algorithm. What can I do with parallel computing actually and
16:43
As I already announced before I'm Used to use these the financial examples and here is a a Monte Carlo problem Which is about the simulation of the Plexigolz-Merton stochastic differential equation Which is kind of a standard geometric Brownian motion
17:00
And which is also applied in many many other areas and physics and so forth And what we want to do is kind of simulate this model and value a European call option Don't worry about the details. It's only to say that this is usually kind of a very compute intensive algorithm to work with and That might benefit usually from parallelization But first the implementation of the algorithm what I do here is kind of already a I wouldn't say optimized implementation
17:26
But at least I'm using numpy and using vectorization approaches to be faster Then for example the typical Python looping that we have also seen as as an alternative before I could have done this Also with Python, but this is the point here I want to stick with this numpy implementation and see what we can do when we parallelize the code
17:44
You see if the import function here within the function because when we use I Python parallel Which I will do here where the whole thing will be pickled and we have to import within the function to get everything to the single workers First as a benchmark, of course the sequential calculation
18:01
This example is only about like calling for a couple of times the same the same function and Parameterizing the function by different strike prices here in this case But again, you can replace this with any function you're aware of which is similar from your specific area and what we're doing here is kind of indeed just looping over the different strikes we interested in and
18:23
Collecting the results that we get back from the function. Nothing special in this simple loop collecting results and we're finished So you see here we do it for a hundred of option calculations And we get back the strikes the list of strikes and The results from our function and in this case the hundred calculations take eleven point four six
18:46
Just results visualize that you get a feel so going over the strikes Well in European call option means the higher the strike the lower the value This is what we would expect so obviously a function works pretty well Now the parallel calculation
19:01
We use here and there are many alternatives I've seen already salary and I know that there will be a couple of talks about alternatives But I Python parallel usually as I said, it's kind of a low-hanging fruit Many people are working with IPython notebook these days and those were well integrated So we can just import from IPython parallel our client class
19:21
Object here and instantiate a client in the background or using for example the IPython notebook dashboard I should have fired up already kind of a Either a local cluster or when working really in the cloud or in cloud-based services you can have huge clients So the largest ones I've heard of were about like five hundred twelve nodes IPython parallel is known to be not that stable when it comes to like a thousand nodes, for example
19:45
So it doesn't really scale beyond a certain point But still for example people doing research or for smaller applications is kind of a pretty efficient way What I'm doing here once I have a client given my profile my local cluster For example, I generate a low-balance view in this case and the code that I need
20:03
To do the same what I've been doing before with the sequential calculation. It's just It's just almost the same kind of two difference actually worth mentioning in this case I don't directly call the function. I rather Asynchronously apply my function given the parametrization to my view
20:22
I append the results and have to wait until all the results have been returned Otherwise the whole thing will break down So these are kind of only two lines if you like are attached in the code and this is not even in the algorithm This is just how I collect the results. So there's not that much overhead Given the the sequential implementation We might have had three new lines and or four new lines in total and one line of code has been changed
20:48
To implement the different approach actually in this case And the parallel execution I'm a little bit surprised. Why does it take 29 seconds on the wall time is not a white one. We have the
21:04
I've been looking at the wall time, but the total time for the execution was five seconds here actually in this case Because we have used multiple Multiple cores so it speeds up by a factor here where we are. We have started Let me get back like 11 seconds and a little bit. Yes, eleven point four seconds
21:21
and we end up here on this machine at five seconds a total time but to have A little bit more rigorous comparison. I come back to the performance comparison by again applying my performance comparison function but here you might have noticed that Implementing this approach leads to different results actually
21:41
We don't get back kind of only the number we get back is the whole Set of results and the metadata which did the single jobs are returning So for example having a look at the metadata you get back much more information Like when it was completed when it was submitted and so forth But we're mainly interested of course in the result and this can be retrieved while this attribute we have this results object
22:01
Here's the attribute and in the end I can Hear why another loop collect all the single results from a parallel application of the algorithm and Just to compare here the sequential and the parallel calculation Of course, there are numerical differences because we are working with a numerical algorithm which implements simulations So we would expect kind of numerical errors or differences, even if you're doing the same thing
22:25
with parallel or sequential but what we are most interested in is the performance comparison and So this end we have used the function already And you see here working on a on a machine with four cores in it
22:40
Leads to a speed up of roughly 3.1. So, of course you have an overhead for Distributing the jobs and so forth for collecting yourself, but in the end you will see that Applying this approach typically scales almost linearly in a number, of course Not in a number of threads hyper threading for these kind of operations don't bring that much but you would see usually as I said
23:03
almost linear scaling in the number of Workers, so for example working we have another server. We use these approaches with eight cores There you see like kind of speed ups of seven times point something But again, not that much overhead involved. No, we haven't changed the algorithm at all
23:21
and By investing maybe an hour of work or whatever you might improve your numerical computations considerably If you're only working locally and are not interested in like spreading the polarization over whole classes or whatever Then is of course the built-in multi-processing module
23:42
Again I present her scales over small and medium-sized classes But sometimes it is helpful even to parallelize code on local machines and I mean I know the percentage but most machines as of today even the smallest notebooks have multiple cores and and even using two cores already might Lead to significant speed ups when you now think of a larger desktop machine where your four or eight cores
24:03
You will see also significant improvements. And again, the fruits are low-hanging in this case as well so we import multi-processing as MP and our Example algorithm here is again Monte Carlo simulation. This doesn't do the valuation, but this doesn't do actually the same thing It does the simulation
24:21
So there's not that much of a difference We have a different parametrization here that we apply but in the end is kind of the same core algorithm that we use here to compare the performance What this does is kind of gives us back simulated paths in our case, it will be stock prices, but also many many Things in the real world and physics and so forth are simulated that way
24:43
I mean prorium motion was invented so to say in the first place For describing the movement of particles and water So, I mean it comes from physics, but the finance guys have adopted all the approaches used in physics So we are simulating paths over time so to say
25:03
What we now do here is kind of a More let's say a rigorous comparison or not rigorous But what we do is kind of we we change the number of threats that we use for the multi-processing Implementation we've kind of a test series and it's implemented on notebook with four cores i7 And we use the following parameters with 10,000 paths that we simulate and a number of time steps is 10
25:25
And what we want to do is kind of 32 simulations which translates to the number of tasks that have to be accomplished here in this case so It's a simple looping Over a list object starting from one and ending to eight So we start with a single thread and end with eight threads and you see there's not that much of code involved
25:46
It's actually pretty comparable to the to the iPison parallel example You just have to define our pool of processes or pool of workers that we use for the implementation And then we map here in this case. They're different approaches I must say but here we map our function to a set of input parameters
26:03
Actually, it works pretty much the same than the than the map function programming Statement in Python, so we map our function to our set of parameters And say well Please go ahead and in the end we wait for the finishing and append the times that it takes for the single
26:21
Run C in this case, but as always a picture says more than a thousand words and you see here We start for the 32 Processes with the time approaching almost 0.7 seconds and we come down to Well something about one point
26:41
Or zero point one five second, but you see multi-threading doesn't bring that much in this case Actually around four five actually in this particular case at five We have the the lowest execution time, but you see the the benefits are pretty pretty high in this case it again Almost scales linearly with a number of cores available not with a number of threads
27:03
But with a number of cores available here for our 32 Monte Carlo simulations And as you have seen is only mainly two lines of code that accomplish the whole trick Let me come to another approach We haven't really touched the code the implementation We just have taken an implementation for the last two examples and have probably likes to think
27:23
But more often than not you want to try first to optimize what is actually implemented and one very efficient approach is dynamic compiling There is available a library called number This is an open source NumPy we're optimizing compiler for Python code which is developed and maintained by continuum analytics and it uses the LLVM compiler infrastructure and this makes
27:47
a couple of application areas for really efficient Yeah Scared to get The collecting of benefits and the low-hanging fruits that are for mentioning so often right now
28:00
That is almost like sometimes really surprising because we will see not that much effort not that much overhead involved But usually you can expect given a certain type of problem Really high speedups first introductory example before I come to a more realistic real-world example And the example is only about counting the number of loops but counting in a little bit like complex fashion
28:22
That we have to transcendental function of cosine here and then calculate the logarithm but in the end this Nested loop structure doesn't do anything else but counting the number of loops it's nothing about that, but we know looping on a Python level typically is expensive in terms of performance and time spent
28:41
And we see it here when we parameterize this looping structure with five thousand five thousand This takes about ten point four seconds to execute in the end. We have a looping which shouldn't come as a surprise Over yeah Twenty-five million iterations here in this case the benchmark again ten point four seconds to remember
29:05
We can of course Do a numpy vectorized approach to accomplish the same result actually would make sense to like count only loops But there are a typical numerical and financial algorithms that are based on nested loops that you can easy
29:21
Easily vectorized with numpy so this kind of very general a very powerful approach but We'd see what the negative consequences are here in this case Again, the function is pretty compact in a sense that we just instantiate here our Array object which is symmetric in this particular case, and we just do the calculation. We just do the summing over our resulting
29:45
Array object where we have applied before the logarithm and the cosine function and then do the summing over the results here in this case I mean, it's always the same What's coming up with the one but nevertheless as compute intensive and we see this already a huge speed up
30:01
The execution time is below one second here in this case by using the vectorized approach So numpy as we know is mainly implemented and see where we are doing this kind of here We like delegating the the costly looping on a Python level to numpy and numpy does it at a speed of C code? Which is a little bit faster as we see here. Actually, we have a speed of more than 10 times
30:23
But there's one drawback Instantiating such a huge array leads to memory requirements Of course here we see we need an array object which in the end consumes 200 megabytes of memory and I mean it's not not kind of a nice feature You have an algorithm which doesn't consume any memory and here in this case using numpy vectorization
30:42
leads to a memory burn of 200 megabyte and now think of kind of Larger problems and you will certainly find some where memory doesn't suffice in the end So this is kind of nice because it's faster But in the end it consumes lots of memory is memory not an issue You might go that route but there is an alternative actually and this is number that I mentioned before and again
31:03
The overhead is like kind of minimal in this case I just import the library usually abbreviated by nb and Call the JIT function for just in time compiling. I mean, it's not just in time. It's not compiled in runtime It's compiled at call time actually But it's called JIT here in this case and I don't do anything with the Python function at all
31:25
so I just let the Python function as it is the F underscore PI and I generate a compiled version of it by using number So now executing this we see that when I call it for the first time It's still not that fast because for the first time I said it's compiled at call time
31:44
There is kind of a huge overhead involved But when I call it for the second time you see this is then ridiculously fast given the Python implementation So here we get to speech where say well now we can compare to C implementations to optimize C Implementations because number uses the LLVM infrastructure and on the LLVM level
32:01
They're kind of these all these optimized compilers that like compile it optimally to the given hardware at hand So this works as well as I will show later on with a different example also on the GPU actually So here we see huge improvements in speedup. And again, I can only stress the point There's not that much effort involved
32:21
It's just the calling of the JIT to the original Python function and here you see kind of huge huge huge huge speed ups given this implementation So it might be worth considering using number when you have a similar problem We see all nested loops and we do this and that and and so forth and the beauty is
32:41
Which comes on top is that the number implementation is as memory efficient as the original one There's no need to kind of instantiate an nd array object with 200 megabytes or even larger so the beauty of the memory efficiency remains and they get these huge improvements, but just Compiling it with number You know me option pricing is kind of a very popular very important America algorithm in the financial world
33:04
So, let's see if it works with that as well Don't worry about the details again. It's just like a parameterization of the model What we have to do here is kind of simulate something Then we calculate some inner values of an option and we do discounting So we have kind of a three-step procedure if you like and the three steps are like illustrated here
33:25
I can do maybe a little bit smaller Again, the code is not that important But there are two point worth noting the one is that I do the whole thing based on numpy arrays So I do if you like Python looping but based on numpy array, so I'm not working on lists with Python loops
33:43
I have my Python numpy in the array objects and I do Python looping here over my arrays and you see we have Three nested loops to implement this when it when I go the looping route This is not that I say you should do that By no means, but I will show the effect of going different routes afterwards
34:02
so just remember Looping over numpy and the array objects and we have three nested loops And by now we should know that looping on the Python level should be costly What does it mean costly in this case the execution for a given number of time steps takes three point?
34:21
Zero seven seconds Actually, this binomial option pricing algorithm solves the same problem as we have been attacking before with the Monte Carlo simulation So we can compare the result and you see here the Monte Carlo simulation Which is usually considered to be the most expensive one when it comes to Computational power that is needed that's even faster in that case is not that exact
34:42
I must say there are numerical errors in there But three seconds for the binomial option pricing model here compared to the 82 milliseconds given our Monte Carlo simulation from before But you see there are similar results that we get from the two Numerical methods, but this is actually the point is just to say well these two algorithms solve actually the same problem in a sense
35:01
The first improvement again, we can go the numpy vectorization route. I said well, I don't Touch the algorithms themselves, I wouldn't say that I touch the algorithm here This is just kind of using different idioms using different paradigms to do to implement the same algorithm in Python And here we can make use of course again of numpy vectorization. It's a two minutes left. Oh, okay
35:25
can do the numpy vectorization actually and What you see from the vectorization process is that is again already much much faster But we now apply the the JIT from number And get back a compile the machine code compiled version of our Python one
35:45
Then you see that we again get a speed up of three times comparing this more rigorously you see here Well, we have the number version is fifty four times faster in this case and three times faster than the number which Let me skip through a couple of slides. There is a study compiling some with Sison as well
36:05
At this point before I forget it if you go to my Twitter feed DYGH I have tweeted the links to two presentations actually this one I have also side presentation so you can read all the things to it I might do it right now going to
36:23
Twitter.com and this DYGH and I have tweeted links To it, so I'm not able to present everything, but here you find the links to the presentations On my on my Twitter feed actually
36:42
study compiling with Sison works similarly Here have examples where you also get kind of huge improvements I skipped through that in order to have a couple of minutes left for questions But again if we do a performance comparison In this regard for example here working with floats, so if you have a look at it
37:01
There's no need to work with floats, but still having this Kind of rigorous performance comparison when you go to the algorithm back you see I have An implementation using Sison and another one with number and here in this case They're actually pretty similar when it comes to performance so with Sison You usually have to touch the code, and you have to do kind of static declarations and so forth
37:24
But with number sometimes I don't say that always don't get me wrong But sometimes you can get the same speed ups, but just using the just-in-time compiling approach of JIT It's actually the last topic is generation of random numbers on GPUs. I want to spend the last minute on that
37:43
Because this might be useful in many many circumstances and usually it's considered kind of a very hard thing to get the power of GPUs Included in your work what I'm using here is number Pro Which is a commercial library of continuum analytics which is kind of the sister or brother Library of number and what I use is kind of the the original the the native
38:04
Libraries that are provided in the queue to lip in order to generate random numbers There's not that much Specialties included that we just generate random numbers Which are stored in a two-dimensional array in that sense here's the code for the for the CUDA function
38:22
CUDA only gives back like a one-dimensional array, so we have to reshape it afterwards, but I mean this is straightforward Want to do here is kind of compare the performance for the different sizes of arrays that where we want to get standard normally distributed random numbers back and I skipped the first slide because I have kind of implemented a rigorous
38:42
implementation and what we see here in this one chart and this Almost says it already that if you just want to generate a few random numbers so to say Then you see that the CPU might be the better one because you have Overhead involved when you're moving data when you're moving code from the CPU to the GPU this overhead of course
39:04
But once you reach a certain size of the random number set you see that the CPU It's not a linear scaling is everybody see the increase in time needed and you see that There's hardly any increase in the time needed on the on the CUDA device here to generate the random numbers
39:20
Here again the message if we have only a small set of random numbers don't go to the GPU There's too much overhead involved remain on the CPU But again if you're working with really large random number sets and here the largest one that I'm generating is 400 megabytes in size per Random number set then you see that of course the the CUDA approach
39:42
Pretty much well yeah of course outperforms the numpy in-memory version with the CPU based on the based on the CUDA approach here in this case so again only a couple of lines of code It's a single library that you call and you get all the benefits from that and you see there's a huge speed advantage
40:02
Over the CUDA device over the number one the last thing I just want to mention is how about I owe Python isn't only when it comes to numerical operations But I had it included in my abstract Python is also pretty good when you want to harness the power of today's I owe Hardware and usually it's pretty hard to get to the to the speed limit of the hardware or with Python and here working with an
40:23
example array object 800 megabytes and just natively save that you can also use your pie tables and hgf5 format and a couple of other things But it's already built in into numpy that you can save your arrays to disk You see this almost at the speed of the hardware that is allowed here writing on the MacBook with
40:43
SST try if you see for the 800 megabytes This is much much faster to save and to load as you see as it is to generate a memory So the in-memory generation of this 800 megabyte array with the memory allocation and the calculation of the random numbers takes five point three seconds But on this machine it only takes two seconds to write it and two seconds to read it
41:01
So you see how fast you can be with Python and there's no kind of performance trickery involved This is just like better is included and Python typically makes it pretty efficient and pretty easy To harness the power of the available hardware as of today and this brings me to the end and thank you for your attention
41:36
Sorry to stay between lunch and you just ask a question if you I think I can hear you I can repeat it
41:57
Of Course of course
42:10
Of course is what is what am I saying? It's kind of for this particular algorithm. We have data parallelism and code parallelism
42:22
This is kind of the the most single simple scenario you can think of of course. I'm pretty aware of that Yeah, they're of course kind of one of the center protocols that I use within we haven't used it actually But of course they are kind of bindings and many application examples and pretty good use cases about that
42:43
Okay, so hi, thanks for the time. Love it Small question. So suppose you were doing a Huge time series analysis. It's not in this scope, but obviously that's something that's kind of hard to do in parallel
43:01
There are algorithms that work very nice in parallel and there's algorithms that don't work very nice in parallel What's your gist on doing things with algorithms that don't work nice in parallel? Like what's the besides compiling? What are some tricks you maybe use? Just this is a question out of curiosity Yeah, and I thought actually if time series analysis, of course one thing we are like concerned with all the day and finance
43:21
So it's one of my major things so to say but I didn't get the question in the end So there are algorithms that are that you can really easily Port something parallel and there are algorithms where this is not so easy I can imagine for certain like very advanced time series models say Arima type of models those don't paralyze that nicely
43:42
What are you? What's your gist on if the problem is hard to paralyze what's the best tactic to approach those problems? I Mean, of course not not everything is kind of well suited to be paralyzed We're using for example or working heavily with kind of least squares Monte Carlo where you need kind of the cross-section Usually what the car you would say a hundred thousand paths. I can parallelize into two times fifty thousand path
44:05
It's the same with the time series on it. You could do like I have a hundred thousand observations I can implement my algorithm the first 50,000 second 50,000 Yeah is one approach to do it But not every algorithm is like well sweetie because you need kind of the whole history or whatever built up So you need the cross-section of the information in order to have your algorithm produce the results that it's supposed to
44:26
So usually I mean the approach that I present is kind of using parallelization for an unoptimized Algorithm is kind of usually not the way to go what you would do in this case when you say well I don't have the algorithm that can be Pretty easily paralyzed, of course you would in any case go for the optimization of the algorithm by using a thyson and everything
44:45
But I think all the libraries is not only thyson what I haven't managed kind of piano for example If you have a look at pi MC 3 that you may this makes heavy use of piano Just-in-time compiling where kind of your objects your classes are like on the fly dynamically optimized for a problem given at hand using just-in-time compiling or
45:04
Call time compiling it's kind of slight difference, but this is typically I think the approach that you would take this So well, let's first optimize with any given means that we have available the algorithms, but I agree not every Algorithms can well suited to be paralyzed But again, if you have two time series to analyze then you can start thinking about that again. This was my my point actually
45:27
Many similar similar tasks and this is of course the trivial case for parallelization So if you're starting off with serial Python code, I think it's pretty obvious that using these
45:42
Parallel tools will make it go faster, but I think lots of people believe that Python is not what you should be using if you want to have efficient parallelization because of the additional overhead because you have To use multi-processing of the gill and obviously Python is a higher-level language So I'm not talking about benchmarks, but just in real-world applications like actual application that you have written
46:04
What do you say to people who would be tempted to stick with C++ to squeeze out that last bit of performance? Like do you find that Python is sufficient for kind of everyday needs for these kinds of simulations so far? We haven't ever Gone the route of going to C++ or C for all our client projects
46:23
Nor for the things that we've been implementing So we're using for some the multi-processing tool for these analytics our library where we have simple easy Scaling and and and parallelization on the machine at least Well, this is typically where the things are run on larger machines with multiple cores with huge memory
46:40
This kind of the scenario that I have I mean many many I can understand people that have issues like with the scaling or like clusters and so forth but of course, I mean You have this word and with Thiessen is kind of the Very good example. We can say well even using Thiessen. I can decide whether I have 90% Python and 10% C so to say or I have something that looks like 90% C and a little bit of Python you can still
47:06
Notice in between so and this is a beauty of Python that you don't have usually these either-or things You can even go for like after profiling where I say well, this is the real bottleneck of the whole thing Let's go for the C route for that. I recently met doing our Python for quant finance meetup in London a guy
47:21
He was a well I again did something in assembler because we thought this would be I don't know if this is the right thing But still you have the flexibility and and there's a beauty of Python that is not either-or that it integrates pretty well And of course C and C++ are the two worlds where Python interacts natively with so whenever I say well I have this approach and I can use this better with C++
47:43
Why not going that so many people are doing that and the financial industry is kind of a standard to do that But we for the things we have been doing Can only say it again for our stuff and for stuff that we've been implementing for our clients We've always stuck to the Python world of course using performance libraries and all that I mean under the hood this comes down to C and then other stuff
48:02
But not on the on the top level where we've been implementing things