Performance Python for Numerical Algorithms
Formal Metadata
Title 
Performance Python for Numerical Algorithms

Title of Series  
Part Number 
62

Number of Parts 
120

Author 

License 
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. 
Identifiers 

Publisher 

Release Date 
2014

Language 
English

Production Place 
Berlin

Content Metadata
Subject Area  
Abstract 
Yves  Performance Python for Numerical Algorithms This talk is about several approaches to implement high performing numerical algorithms and applications in Python. It introduces into approaches like vectorization, multithreading, parallelization (CPU/GPU), dynamic compiling, high throughput IO operations. The approach is a practical one in that every approach is illustrated by specific Python examples. The talk uses, among others, the following libraries: * NumPy * numexpr * IPython.Parallel * Numba * NumbaPro * PyTables

Keywords  EuroPython Conference EP 2014 EuroPython 2014 
00:00
Area
Domain name
Algorithm
Latent heat
Computer animation
Algorithm
Lecture/Conference
3 (number)
Coma Berenices
Correlation and dependence
Right angle
Food energy
00:42
Metropolitan area network
Berlin (carriage)
Web page
Interior (topology)
Maxima and minima
Set (mathematics)
Menu (computing)
Semantics (computer science)
Hand fan
Element (mathematics)
Mathematics
Duality (mathematics)
Word
Data management
Sic
Computer animation
Universe (mathematics)
Selforganization
Right angle
Pattern language
Conditionalaccess module
Wide area network
01:40
Decision theory
Valueadded network
Data management
Gastropod shell
Library (computing)
Computing platform
Surjective function
Area
Addition
Decision theory
Suite (music)
Computer file
Moment (mathematics)
Analytic set
Coma Berenices
Computer programming
CAN bus
Data management
Computer animation
Uniform resource name
Computer cluster
Computing platform
Right angle
Pattern language
Gastropod shell
Laptop
Thomas Bayes
Wide area network
02:34
Axiom of choice
Complex (psychology)
Context awareness
Group action
Thread (computing)
Programmable readonly memory
Covering space
Range (statistics)
Special unitary group
Area
Neuroinformatik
Expected value
Data model
Sign (mathematics)
Roundness (object)
Different (Kate Ryan album)
Singleprecision floatingpoint format
Core dump
Set (mathematics)
Videoconferencing
Row (database)
Social class
Metropolitan area network
Programming paradigm
Constructor (objectoriented programming)
Electronic mailing list
Amsterdam Ordnance Datum
Computer simulation
Mass
Bit
Instance (computer science)
Demoscene
Message passing
Arithmetic mean
Graph coloring
Oval
Uniform resource name
Order (biology)
IRIST
Personal area network
Pattern language
Summierbarkeit
Quicksort
Geometry
Wide area network
Point (geometry)
Module (mathematics)
Algorithm
Network operating system
Maxima and minima
Similarity (geometry)
Data storage device
Mass
Parallel computing
Rule of inference
Valueadded network
Number
Element (mathematics)
Latent heat
Goodness of fit
Performance appraisal
Arithmetic mean
Term (mathematics)
Representation (politics)
Codierung <Programmierung>
Nichtlineares Gleichungssystem
Data structure
Implementation
Conditionalaccess module
Surjective function
Pairwise comparison
Information management
Zoom lens
Standard deviation
Uniqueness quantification
Physical law
Coma Berenices
Core dump
Ultraviolet photoelectron spectroscopy
Line (geometry)
Limit (category theory)
Cartesian coordinate system
Word
Uniform resource locator
Loop (music)
Personal digital assistant
Function (mathematics)
Universe (mathematics)
Iteration
Resolvent formalism
Freezing
Library (computing)
Standard deviation
System call
Code
State of matter
Multiplication sign
View (database)
Decision theory
1 (number)
Icosahedron
Thermische Zustandsgleichung
Variance
Formal language
Derivation (linguistics)
Order (biology)
Mathematics
Bit rate
Semiconductor memory
Cuboid
Process (computing)
Area
Algorithm
Simulation
File format
Data storage device
Physicalism
Parallel port
Range (statistics)
Menu (computing)
Twodimensional space
Statistics
Benchmark
Type theory
Array data structure
Exterior algebra
Computer configuration
Vector space
Computer cluster
Linearization
Compilation album
Right angle
Task (computing)
Resultant
Functional (mathematics)
Implementation
RegulĂ¤rer Ausdruck <Textverarbeitung>
Mapping
Quantum state
Dot product
Artificial neural network
Virtual machine
Emulation
Sequence
Revision control
Funktor
Root
Readonly memory
String (computer science)
Operator (mathematics)
Differential equation
Right angle
Summierbarkeit
Absolute value
Loop (music)
Execution unit
Raw image format
Sine
Forcing (mathematics)
Expression
Numerical analysis
Interactive television
Planning
Ripping
Brownian motion
CAN bus
Plane (geometry)
Computer animation
System on a chip
Calculation
Royal Navy
Synchronization
Object (grammar)
18:48
Group action
System call
Code
View (database)
Multiplication sign
1 (number)
Client (computing)
Function (mathematics)
Parallel port
Special unitary group
Different (Kate Ryan album)
Local ring
Social class
Metropolitan area network
Algorithm
Sampling (statistics)
Parallel port
Bit
Type theory
Exterior algebra
Uniform resource name
Information security
Resultant
Laptop
Wide area network
Point (geometry)
Laptop
Functional (mathematics)
Implementation
Overhead (computing)
Service (economics)
Divisor
Artificial neural network
Virtual machine
Maxima and minima
Denialofservice attack
2 (number)
Profil (magazine)
Term (mathematics)
Summierbarkeit
Pairwise comparison
Execution unit
Multiplication
Percolation
Scaling (geometry)
Projective plane
Code
Core dump
Client (computing)
Total S.A.
Line (geometry)
Cartesian coordinate system
Local Group
Computer animation
Personal digital assistant
Function (mathematics)
Normed vector space
Point cloud
Object (grammar)
21:29
Thread (computing)
Price index
Water vapor
Special unitary group
Computer programming
Expected value
Pointer (computer programming)
Different (Kate Ryan album)
Computer configuration
Object (grammar)
Aerodynamics
Error message
Social class
Metropolitan area network
Arm
Mapping
Valuation (algebra)
Digitizing
Web page
Electronic mailing list
Amsterdam Ordnance Datum
Computer simulation
Parameter (computer programming)
Bit
Maxima and minima
Port scanner
Sequence
Arithmetic mean
Root
Process (computing)
Numeral (linguistics)
Ring (mathematics)
Uniform resource name
Personal area network
Pattern language
Summierbarkeit
Simulation
Programmschleife
Spacetime
Design of experiments
Wide area network
Point (geometry)
Module (mathematics)
Personal identification number
Open source
Maxima and minima
Student's ttest
Plot (narrative)
Valueadded network
Number
Term (mathematics)
Energy level
Data structure
Implementation
Compilation album
Transzendente Zahl
Mobile app
Pairwise comparison
Overhead (computing)
Scaling (geometry)
Information
Lemma (mathematics)
Code
Counting
Coma Berenices
Core dump
Ultraviolet photoelectron spectroscopy
Line (geometry)
Cartesian coordinate system
Limit (category theory)
Compiler
Word
Loop (music)
Integrated development environment
Personal digital assistant
Function (mathematics)
Spherical cap
Statement (computer science)
Numerical analysis
Iteration
Library (computing)
Building
System call
Run time (program lifecycle phase)
JustinTimeCompiler
State of matter
Code
Multiplication sign
Direction (geometry)
Computeraided design
Set (mathematics)
Compiler
Parameter (computer programming)
Mereology
Total S.A.
Thermische Zustandsgleichung
Mathematics
Veryhighbitrate digital subscriber line
Semiconductor memory
Vertex (graph theory)
Endliche Modelltheorie
Series (mathematics)
Pairwise comparison
Point cloud
Area
Algorithm
Simulation
Concentric
Tap (transformer)
Open source
Parallel port
Physicalism
Menu (computing)
Term (mathematics)
Benchmark
Type theory
Compiler
Compilation album
output
Hill differential equation
Energy level
Procedural programming
Parametrische Erregung
Resultant
Laptop
Functional (mathematics)
Implementation
Overhead (computing)
Real number
Artificial neural network
Virtual machine
Trigonometric functions
2 (number)
Revision control
Programmschleife
Causality
Readonly memory
Operator (mathematics)
Software testing
Gamma function
Summierbarkeit
Mathematical optimization
Task (computing)
Module (mathematics)
Rule of inference
Supremum
Context awareness
Execution unit
Raw image format
Online help
Approximation
Performance appraisal
CAN bus
Particle system
Number
Computer animation
System on a chip
Calculation
Royal Navy
Object (grammar)
Routing
Vacuum
33:26
Point (geometry)
Rounding
Slide rule
Presentation of a group
System call
Hecke operator
Divisor
Link (knot theory)
Length
Multiplication sign
Virtual machine
Special unitary group
Graph coloring
Valueadded network
Variance
Number
2 (number)
Twitter
Revision control
Programmschleife
Bit rate
Endliche Modelltheorie
Conditionalaccess module
Summierbarkeit
Error message
Loop (music)
Metropolitan area network
Information management
Algorithm
Simulation
Programming paradigm
Real number
Sound effect
Ultraviolet photoelectron spectroscopy
CAN bus
Array data structure
Arithmetic mean
Word
Process (computing)
Computer animation
Vector space
Personal digital assistant
Website
Object (grammar)
Routing
Form (programming)
Resultant
Wide area network
36:20
JustinTimeCompiler
Code
Multiplication sign
Curve
Set (mathematics)
Coma Berenices
Parallel port
Bound state
Analytic set
Special unitary group
Fluid statics
Array data structure
Bit rate
Semiconductor memory
Befehlsprozessor
Object (grammar)
Moving average
Physical system
Metropolitan area network
Service (economics)
Electric generator
Regulator gene
Menu (computing)
Twitter
Derivation (linguistics)
Message passing
Befehlsprozessor
Workload
Uniform resource name
Order (biology)
Right angle
Modul <Datentyp>
Electric generator
Laptop
Wide area network
Slide rule
Random number
Implementation
Functional (mathematics)
Random number generation
Overhead (computing)
Link (knot theory)
Network operating system
Antimatter
Virtual machine
Maxima and minima
Division (mathematics)
Declarative programming
Valueadded network
2 (number)
Number
Power (physics)
Product (business)
Revision control
Architecture
Natural number
Operator (mathematics)
Computer hardware
Maize
Summierbarkeit
MaĂź <Mathematik>
Operations research
Addition
Pairwise comparison
Raw image format
Scaling (geometry)
Physical law
Memory management
Code
Coma Berenices
Basis <Mathematik>
Grand Unified Theory
Limit (category theory)
Power (physics)
CAN bus
Number
Computer animation
Personal digital assistant
System on a chip
Function (mathematics)
Computer hardware
Object (grammar)
Library (computing)
41:53
Cross section (physics)
Serial port
JustinTimeCompiler
Code
Multiplication sign
System administrator
Client (computing)
Mereology
Semiconductor memory
Different (Kate Ryan album)
Core dump
Square number
Endliche Modelltheorie
Social class
Exception handling
Algorithm
Theory of relativity
Keyboard shortcut
Highlevel programming language
Sampling (statistics)
Parallel port
Bit
Benchmark
Type theory
Arithmetic mean
Process (computing)
Summierbarkeit
Resultant
Point (geometry)
Overhead (computing)
Real number
Virtual machine
Similarity (geometry)
Time series
Machine vision
Goodness of fit
Root
Lecture/Conference
Energy level
Selectivity (electronic)
Mathematical optimization
Task (computing)
Addition
Standard deviation
Multiplication
Scaling (geometry)
Information
Projective plane
Cartesian coordinate system
Subject indexing
Personal digital assistant
Fiber bundle
Object (grammar)
Communications protocol
Routing
Library (computing)
00:00
if
00:15
you what comes to the performance Python talk from America algorithms monitors the Eva
00:20
patient's fallen energy right of the Python once as the name suggests that we may be doing work in financial industry so my examples will be primarily of financial but I think they apply to many many other areas so you not from finance and you won't have trouble translate what i presented a twoyear specific domain area before I come to to talk a few words about
00:44
me and us Thomas and
00:46
funding and former managing partner of the patent once was a lecturer from a finance and so and university and coorganizer of a couple of conferences and organize over a couple of meetups actually all send the wrong triphonic 1 topics I've written a book the finance which become derided this autumn's already out as an evil cult show later on and another book through semantics of Python which will be published by Wiley finance and next year probably apart from finance elected to martial arts actually this is the book and actually today's talk is based on chapter 9 of the pattern findings right book as a set is already out as an early release of the book and the from the words and probably come out at elements in the November is kind of data I'm finished with my editing I hope or right it will it will come along with the principle of final words pretty soon and that's also a the
01:41
course right out now come on 1 side actually which also covers the topic and that the topics are represented a slightly on landbased 1 maybe you wanna have a look when you come from the finance industry I think that the benefits are the highest in this area doing otherwise is at the moment mainly work in what we call the pattern on platform we will provide a based infrastructure for quants working with Python and applying all the the all the
02:13
nice things that presented a should later on maybe with a couple of examples of the different criteria but in the Bay of IPython shell is defined management taking your money needs in addition we also provide all appropriate Terry analytics streets decision and it's analytics on this platform that's but nobody talk what is this
02:35
talk about actually when it comes to performance critical applications to things should always be checked from my point of view of using the right implementation paradigm sometimes this boils down to what is typically called idioms and I'll be using the right performs libraries so many of you might heard prejudice that pythoness slope of course pattern can be slow but I think if you doing it right but Python Python can be pretty fast and I 1 of the major means in this regard outperforms libraries that are available in the Python world on and I can only briefly touch upon all of these are the listed here I think it was yesterday on the talk by from about the price on so about any topic that you see here you can have a complete talks or you could tutorials for daily for week for some so it's a complex topic but my talk is more about showing what can be done the main approach would be to say this is what it was before it was limits law that we applied this and that and after see these improvements we don't go that much behind the scenes we don't do any profiling to in this talk we will see in many many cases when it comes to the merits by phone and these libraries can help in improving performance substantially and candidate for stopping pattern paradigms and performance as a said what I call paradigm here in this context usually is called idioms for example and this is just the function you see here don't try to give detailed just function I will use regularly and have provided to you in then the slice and that you can use it afterward as well justified for compare a little bit more systematically different implementation approaches and compare performance more iteratively but there's nothing special about that that becomes the 1st use case a mathematical expression that doesn't make too much sense with the square root of the absolute value of the transcendental functions and there in a couple of things that are happening they might encounter these or similar expressions in many many areas as amended before and and find the math financing have these of these and 4 6 and you name it and almost any signs as of today you find such a similar numerical expressions we can implement this pretty easily as a Python function see here it's a single equation and translate this mainly in a single line function nothing special about that what we wanna do however is to apply this particular function to a large array to a list object in the 1st case with 5 thousand 500 that was numbers actually and so this is kind of really usually the computational burden comes in when you have a huge dataset and you will apply these expression to the huge datasets is up there the single equation is kind of complex but it's in the and the mass of computations it makes it to be too slow so to start working with regenerate India list object using simply range with 500 thousand number so we then do is to implement another function which uses origin of function f implementing numeric expression but we have a simple looking this is the 1st implementation of 6 that so this is a pretty simple straightforward function where there is a for loop in there we have another list object and just calculate the single the single values and append the results to the other list object and then returns our list with the results a 2nd 1 2nd paradigm idiom if you like is to use a list comprehension actually the same thing is happening as before but it's a much more compact way to write the same thing so we generated in a single line of code list object by iterating over or our list object a and collect results given the values of the function of that it more compact maybe better readable if you applies encoder you might prefer this we can also do it this way flexible I would suggest to do it in that case you see this will be the slowest 1 was and we're working for example with classes objects revalued derivatives and like kind of complex payoffs and so forth and can describe these like in a string format that makes it pretty flexible to provide different payoffs for this class and this is for example 1 area where we use it but typically when we use it as only the ones that we have to evaluate that expression but in this case you might notice that the expression is evaluated per single iteration of the list of each so it's it's and that as you will see this is a very intense compute tens or or interpreted in terms of the way to do it too like every time iterate over the expression to be evaluated this will make it pretty slow as we'll see of course if you're working exercise be used to vectorization of port number what we can do is kind of implement the same things this a this time now on a number like India rate object which is specially designed of course to handle such datasets and such data structures and with a single line of code and using vectorized expressions become accomplish the same thing so now we're working on plane and the array objects and using number universal functions to calculate would be interested similar to the list comprehensions some text but the end we would hope for speed improvement because this is especially designed to exactly do these kind of operations then we can also use dedicated specific library which is called from America expressions he in this case we can provide the whole expression as a string object but in this case by she what happens is that the string object and this expression is compiled only once and then use afterwards and here again we are working on in here a objects on them expression is especially designed to work on a plate and the array objects so in this case also see hopefully some kind of improvement because it's kind of a dedicated specialized library to detect these kind of problems you might have noticed that in the 1st example I've set the number of threats to 1 to have kind of a benchmark ability we only using 1 thread 1 quarter in this case here I increase the number of threats to force so if you have a 4 core machine you would expect kind of improvement but what kinds of rules let us have a look In summary and again we have 6 different paradigms Leighton's use of and in the end this kind of delivering in any case the same result so as is often the case that when you see people coming from other languages coming to Python being new to Python but knowing all the idioms they're using probably those that there used to from other languages like C or C + + you name it and sometimes this can be because for instance that they using maybe the wrong paradigm the wrong medium but that's look will differences are not all comparison function comes into play and we have a clear winner obviously peripheral multithreaded version at 6 was the last 1 using modern threats to evaluating America expression on the array object the greatest singlethreaded 1 which is the 2nd fossils and food 1 is the num highways in the and then the Python ones follow off the so we see eye to this kind of a given the list comprehension for example River 28 times increase in performance using the multithreaded nomics words and as I mentioned already before the A 3 of this was 1 of which uses T a built in evil function of Python to see that we have the a speed of and totally of 900 times so this can vary of course depending on number of friends a user and so forth but the methods and the message I think should be clear that so we haven't Python many many ways to attack the same the very same problem and all the ways will yield the same results but they
10:42
might be considerable performance improvements when going to right foot and so on avoiding pitfalls and especially warning yeah implementations that up see intensive so this is for example where profiling would come into play and other people don't presented here I said my approach is more like this before but we do something you compare and and this is what is afterwards so but profiling of course delivery that you will spend a very time consuming function and most time spent for example with a 3 in this type of some of implementation become just briefly to a rather subtle things uh seeing the American algorithms implemented based on number URI object's speed directly by the use of non applied new was a functions or by using num expression have been the fast uh but this kind of and circumstances I and I encountered that quite a while ago and was 1st like the know what's what was going on in there but later on it became pretty clear what's going on so it's sort of my when worth mentioning depending on the specific algorithm that using the memory layout can play indeed a role when it comes to performance that we start with the 2 because number India rate objects of which we instantiate by providing the D type in this case will 64 and here the order of Morelia comes into play with 2 choices the number the scene for like layout and F 4 4 from like layer in this case you don't see any difference it's nothing special you see just the numbers printed out but don't get confused because this is just perfect representation of what data is stored actually at the moment but if we have a an array like this you can explain what memory layout is all about actually we have to see like is role storage the what you do Odyssey this means that the 1 the tools and the freeze of salt next to each other so this is how I any memories onedimensional thing so we hope that storage given a unique location in memory so we don't have kind of twodimensional ndimensional things we can store to this kind of linear thing so we have to decide how to put multidimensional thing things into memory and this is how is stored when you use the order of C using your and then in this case we have a common by storage which means that the 1 2 3 1 2 3 and the other 1 2 3 is stored next edge of working with such small number and array object's doesn't make that that much of a difference but when you come to larger ones and in particular to asymmetric wants like this 1 we see we have 3 times about 1 . 5 million elements in their then we can expect some differences in performance read associate to different and you're objecting the 1 with the Odyssey of course and other 1 with ever to just compare it and we now start calculating sums with some of this you want to see already kind of the difference when you're calculating the sums over the different axes so numb was aware of the X a list objects and construct like a twodimensional things with this objects there's no random so it's kind of knowledge of you before the axis but in this case we can calculate the sum rowwise or columnwise if if you like and you see this kind of a uge difference yeah like a 50 % difference when it comes to the 2 different axes only performs of calculating the 1 that delivers that kind of 1 . 5 million onedimensional resolve the other 1 returns a result which is only 3 elements in this case but of course the numerical operations are running differently over memory for both annotations you observe the same thing so according to the results here uh going along axes 1 which means the 2nd axis was with your numbering is much much faster than the other way round so if you have these problems and you if implemented might be worth considering does it make sense to have 3 times 1 . 5 million array or 1 . 5 million times 3 right so you will see considered performance improvement going to 1 where you want the other depending on what yeah exactly interests when it comes to calculation on 3rd sums with video for any array objectives to these operations are actually both slower the absolute is slower than the order operations but to see different relative performance so in this case doing this summer according of all along the axis 0 it's the 1st axis is faster relative to the other the same actually holds true but really this is pretty close to the standard deviations and you see that this is also the absolute disadvantage might be due to defective CCD full and the world in an unpleasant more maintained or more important but once they have to for example interact with the force from the world and you like required so to say to work with the for them I'm extends again to answer the question is 3 times 1 . 5 million better or 1 . 5 million 10 3 so you will see certain cases of huge difference but becomes another approach which is usually and I think all the boxes that presented they are like in a certain sense lowhanging fruit this typically not that much involved when it comes to for example at the redesign of the algorithm itself so I don't do any redesigns of algorithms and I'm always sticking to the same problem to the center of annotation then showing what it can do and sometimes you will see that of course is using different libraries you need to rewrite something but it's not about the year for example parallelization of a certain algorithm where represent now it's more like given a certain implementation of an algorithm what can I do with parallel computing action and as I already announced before and used to use these the financial examples and here is a among the color problem which is about the simulation of the principles move differential equation which is kind of a standard Geometric Brownian motion also applied in many many other areas of physics and so forth and when do was kind of simulated small and European culture Dunbar about the details sorry to say this is usually kind of a very compute intensive and algorithms to work with and that might benefit usually from pelletization but 1st implementation of the algorithm but do you know this kind of already a I would say optimized implementation but at least I'm using umpire and user vectorization of courses to be faster than for example the typical Python looking that we have also seen as a as an alternative before I could have done this ultimate Python but this point here on stick with this kind of implementation and see what we can do what we call a lexical decision on the import function here within the function because you use IPython parallel which I will do here where the whole thing would be people then we have to import within the function to get everything to the single word 1st as a benchmark for was a sequential speculation this example is only about like calling for a couple of times the same on the same function and parameterizing the functions by different strike price in this case but again you can replace this with any function your A. of which is similar from Europe specific area it will be doing here's kind of indeed just looking over the phone strikes the interested and collecting the results that we get back from the function nothing special this is simple to collecting results we finished so it's a URI good for 100 of of some calculations we get back the strikes the list of strikes and the results from our function and this case 100 calculations state important forces just results visualized to to get a feel of
18:49
going on strike slowing European culture means the highest like the lower the value the 1 would expect so the function works not growing collection we
19:01
use here and there are many alternatives have seen already in the summary and I know that there will be a couple of talks about alternatives like Python prose use this kind of lowhanging fruit many people are working with IPython notebook these days and others were well integrated so we can just import type of parallel client class object here and instantiated climbed in the back from using for example the output and the potential what should have already kind of a is a local cluster when working really in the cloud or Hankel based services you can you clients the largest ones side for the wearable like 12 notes I present parallel is known to be not that stable when it comes to a thousand nodes for example so doesn't really scale beyond a certain point but still for example people doing research for smaller applications kind of a pretty efficient way but I think here once I have a client given my profile local clusters which samples generated ill buttons view in this case and the coat that 8 to do the same but often going before aristocratic percolation just this just almost the same kind of 2 difference actually worth mentioning in this case I don't directly call the function I'd rather asynchronously my function given the privatization to my view I print the results and have to wait until all the results have been returned otherwise the whole thing the down so the 2nd of only 2 2 lines if you like are attached in the cold and this is not even in the algorithm just collect results so there's not that much overhead given the sequential implementation of you might have had 3 new lines and and OR for new lines and total and 1 line of code has been changed to implement the different EU project in this case and parallel execution but it's surprising is a take 20 seconds on the wall terms of a white 1 we have the that we still have the looking at the world but the total time for execution was 5 seconds the action in this case but because of use multiple multiple cost so it speeds up by a factor here where we are we have started to get back like 11 seconds and little bit yes 1 4 seconds it and appear on this machine at 5 seconds total time but to have from a little bit more edges comparison I come back to the
21:30
performance comparison by again applying my performance comparison function uh but you might have noticed that the implementing this approach leads to different results actually but we don't get that kind of only the number would get back is the whole set of results and amid the data which did this single jobs are returning so for some having a look at me that that he get picked much more information like when it was completed when it was submitted and so forth but maintain a state of course and the result and this can be retrieved whether such a beautiful of this result object is interviewed and in the end I can here when added to collect all the singer results from a parallel applications some of the algorithm and just to compare here the sequential and parallel calculation of course American differences because we're working with numerical algorithm which implements simulations we would expect kind of numerical errors of differences in the you doing the same thing be parallel sequential but what we're most interested in is the performance comparison and for this and we have used a function already and you see you working on a on a machine with 4 courses in that leads to a speedup of roughly to 3 . 1 so of course we have an overhead for distributing jobs and so forth for collecting itself but in the end you will see that applying this approach typically scales almost linearly in the number of course not the number of threats for these kind of operations stomping that much but you would see usually is that almost linear scaling in the number of workers so for example working of another so we use the support of a cost and this like kind of speedups of 7 times point something but again it's not that much overhead involved but we haven't change algorithm at all and but investing maybe an hour for you might say improving numerical computations considerably
23:31
if you only working locally and not interested in like spreading liberalization overhaul classes a whatever that is of course a building market processing modules on again I present various skills small and mediumsized classes but sometimes the simple even to paralyze called on local machines and I mean I don't know the percentage but most machines as of today even a small is no books have multiple courses using 2 called already might lead to significant speedups and in think of a lot of this machine where for each course see also significant improvements and again the fruits all hanging in this case as well so we import processing as M P and example algorithm here is again will become simulation this doesn't do the evaluation but this isn't direction the same thing uh this simulation so so there's not that much of a a difference of different part parametrization here that we applied but in the end is kind of the same quarter algorithm that we use here to compare the performance what this does is kind of gives back simulated paths in our case it will be stock prices but also many many of things in in the real world and physics and so forth a simulated that way the prime motion was invented so to say in the 1st place for describing the movement of particles in water and so I mean it comes from physics but the the finance guys have adopted all the approaches used in physics so we simulating past all the time so to say but we know the yeah it's kind of a the more that's a rigorous comparison on because of what we do is kind of we change the number threats that we use for the multiprocessing implementation ever test series and it's it's implemented on local with 4 cause I 7 and and we use the following parameters 10 thousand possible simulate times of stand and when he was kind of 32 simulations which translates to the number of tasks that have to be accomplished in this case and so as a simple looking all a list object starting from 1 and ending to 8 so we start with a single thread and the threats and seat is not that much of cold involved it's actually pretty compared to the to the IPython parallel example just have to to define pool of processes of pool of workers that we use for implementation and then we met him this is a different approach I must say but here a map all functions to a set of input parameters actually works pretty much the same than the than the map of programming and statement in Python so we met our function to a set of parameters and say well please go ahead and in the end we wait for the finishing and and the time that it takes for the single runs in this case 1 is always a pictures more than a thousand words and this year we stopped for the to processing time approaching almost . 7 seconds and we come down to well something about 1 point of was 0 . 1 5 seconds would assume multithreading doesn't ring that much in this case actually around of 4 5 actually in this particular case 5 with have the lowest execution time would see the and the benefits of pretty high in this case it's again almost scales linearly with the number of courses available not with the number of threads of number of courses available here for all 30 to 1 the concentrations and assisting only mainly 2 lines of code that accomplished the whole trip the but becomes another approach we haven't really touch the cold the implementation just taken in implementation for the last 2 examples and probabilistic but more often than not you want to try 1st to optimize what is actually implement and 1 very efficient approach is that in the compiling there is a way to build a library called the number this is an open source number there optimizing compiler for Python code which is developed and maintained by continuing text and uses the IBM compiler infrastructure and this makes it a couple of the application areas for really efficient yet scary get the collecting of benefits and underlying full set of emotions or now not so that it's almost like sometimes really surprising because you see but of much of another much overhead involved but usually can expect given a certain type of problem and really high speed up 1st introductory example before I come to more realistic realworld examples and the example is only about counting the number of loops but counting in a little bit like complex especially the rift transcendental functions of course I'm here and calculate the algorithm but in the end this nested loop structure doesn't do anything else but counting the number of times it's nothing about that but we know it will be Python level typically is expensive in terms of performance and time spent and we see Europe paramiterized is looking structure with 5 thousand 5 thousand this takes about 10 . 4 seconds to execute in end but looking it shouldn't come as a surprise arms what was there 25 million iterations genders the benchmark against 10 . 4 seconds to remember we cannot course don't not vectorized approach to accomplish the same result as you would make sense to like count only loops but there are a typical American and French algorithms that are based on nested loops that you can easily that he did that using the vectorized non part so this kind of very general and powerful approach but I would see would the negative consequences on in this case again the function is pretty compact in the sense that we just instantiate here our array objective which is symmetric and in this particular case and we just do the calculation just to the summing over all resulting in an for a objective applied before the algorithm and the cosine function and then the summing over results in this case mean is always the same with what's coming up with the 1 but nevertheless is compute intensive and we see this already use execution time is below 1 2nd in this case by using the vectorized approach approximate part wasn't known as many implemented and see where we're doing this kind of feel we like delegating the costly looking at a Python Levitin environment number but those the speed of C code which is don't let this process you see here actually have a speed of more than 10 times better there's 1 instantiating such a huge arrange leads to memory requirements she said we need an array object which in the end consumes 200 megabytes of memory and time is not the kind of a nice feature you have an algorithm which doesn't seem any memory and in this case using on privatization leads to memory going to me about the now think of kind of lot of problems and you would certainly find some lower memory does this efficiently and so this is kind of nice because it's faster but in the end it consumes lots of memory is memory not an issue you might go that route but there is no alternative actually and this is number of the mentioned before and again the overhead is like kind of minimal in this case I just importer library usually provided by a and B and called Ajit function justintime compiler me is not just in time is the compiled runtime is composed call time actually but it's called jitter in this case and I don't do anything with the Python functional so I just let the pattern function as it is the of underscore by and generate a compiled version of it by using numbers so now executing this we see that when I call it for the 1st time it's still not that fast because for the 1st time students compiled it call time there is a kind of a huge overhead involved when I call it for a 2nd time to see this is then ridiculously fast given the Python implementation so here we get 2 spaces so well now we can compare to see implementations optimize the limitation because number uses the BM infrastructure and on PM level they're kind of these all these optimize compilers of compilers optimally to the given problem at hand so this works as well as I would show later on over different example also on the GPU they receive huge improvments and speed up and again I can only stress the point this not that much effort also just a calling of digit to the original Python function and you see kind of you use usage uture speedups given this implementation so it might be worth considering using number when you have a similar problem was on the solution to this and that and so forth and the purely is which comes on top is that the number implementation is as memory the original 1 there's no need to kind of instantiating you may object with 200 megabytes of large and so the beauty of the memory efficiency remains and they get these huge improvements but just 1 compiling it with number the only option price kind of a breakup of the important of America and of the financial world so let's see if it works with that as well but the worry about the details against us like privatization of the model so all we have to do here is a kind of simulated something Gambia take a little bit of use of my option and we do disconsolate kind of 3 steps procedure if you like and 3 steps are like illustrated here can do maybe the small
33:28
again the court is not that important but there are 2 points worth noting that 1 of the the whole based on the high rates if you like a hyphen between but based on the higher rates and working on this buffaloes and the phone number in the array of length and have looking here all my minorities initially we have 3 nested loops to implement this when it when I go on looking at this is not an essay you should do that by no means but I would show the effect of going different routes afterwards on so just remember looping over number in a objects and we have 3 nested loops and but now we should know that looking at Python that there should be costing what is the mean
34:16
costly in this case the execution for a given number of time steps takes 3 points of 0 7 seconds actually this phenomenon comprising algorithm solves the same problem as we have been attacking before them multicolor simulation so we can compare results this year among the color simulation which is usually considered to be the most expensive 1 when it comes to computational policies needed this even faster that case is not that access to a mistake America errors in there but 3 seconds for the binomial surprising models here compared to the 2 milliseconds given all the constellation of before I'm going to see a similar results we get from the 2 American methods but is this is actually the point is just to say what these 2 algorithms solve actually the same problem in the sense the 1st approval again letting go of the number factorization that I said well I don't know touched the algorithms themselves I would say that had she algorithm yeah this just kind of using different mediums using different paradigms to do to implement the same algorithm in Python and you we can make use of course again of number victimization as a tumor is left out it we can do that the number vectorization actually and then what is the from the vectorization of processes and this is again much much faster but we now apply the the ejected from number and get vector compiled a machine called compiled version of Python 1 uh this year we again disputable 3 times compared this more rigorously this year well we have the number words in this 54 times faster in this case and 3 times faster then the number of words let me skip through a couple of slides there's a steady compiling some of the sites as well so at this point before forget if you go to my Twitter feed DY GH I have treated about the links to 2 presentations I should this wonderful SOsub presentation so you can read all the things I to it and I do it right
36:20
now going to twitter . com and this do you want your age and i have
36:29
tweeted links to
36:33
itself and that they would send everything but you find the links to the presentations on on my How much with actually I
36:42
studied by size works similarly here examples we also get kind of future improvements I skip through that are known to have a couple of minutes left for for questions but again if if you do a performance comparison and in this regard for example yeah working with floats so if you have a look at it there's no need to work with floats but still having this had a rigorous performance comparison when you go to the return back to see I have an implementation using 5 songs and another 1 with a number and you in this case there actually pretty similar when it comes to performance the size of the users have to attach to cold and you have to kind of static declarations and so forth but with number sometimes I don't think it always so wrong but sometimes you can get the same speed up so that just using the the justintime compiler approach of actually last August generation of random numbers and you use was the last minutes on that because of this might be useful in many circumstances and is considered kind of a very hard thing to get the Paul to be used included your work when using his not uproar which is a commercial library of antimatter takes this kind of this system or a product a library of number of I will use this kind of the the recently the native libraries that are provided that you don't have in order to generate random numbers but but this is not that much specialties included that we just generate random numbers which are stored in a twodimensional array in that sense that he has a cold for the fully cured function can only use that like a onedimensional array so we have to reshape it afterwards but I mean this is straightforward when doing this kind of compare the performance for the different sizes of erase that we wanna get standard only distributed random numbers back and I skip the 1st slide because they have kind of implemented a rigorous implementation what we see here in this 1 shot and this almost since already that if you just want to generate a few random number so to say but then you see that the CPU might be might be a better 1 because you have overhead involved in a moving data moving code from the CPU to the GPU based on cost but once you reach a certain size of the random number said to see that this if you want so the linear scaling is that when you see the increase in time needed this hardly any increase in the time needed on the on the CUDA device generate a random number it's uh here again the message if you have only a small set of random numbers don't go to to distribute has too much overhead involved remain on the CPU but again if you're working with very large random numbers it's in the largest 1 generating is 400 megabytes in size the random numbers and then you see that of course the the acute crossed the pretty much will it cost of performance of the number in memory version with the CPU based on the basis of the the acute approach in this case so again only a couple of slides coated a single library that you call and get all the benefits from that you see this as a you speed advantage over Q to devise a little number 1 the last thing I just wanna mention is how about Python is a boy when it comes to numerical operations but I hadn't included an abstract Python also pretty good 21 harness the power of 2 days I all hot and use is pretty hard to get to the to the speed limit of the heart where with Python and you're working with an example a object right and biased and just native who say that you can also use your title of nature if forward and a couple of other things but is already built into number can save you raised to this you see is almost the speed of the hardware that is allowed to here writing on the Mac Book of to try later megabytes this is much much faster to save and to load same as it is to generate the memory for the memory generation of this data may about rate with the memory allocation and the regulation of the random numbers takes 5 . 3 seconds but on this machine it only takes 2 seconds to write it in 2 seconds to reach so you see how fast you can be with Python additional kind of performance agreeable this is just like batteries included a Python typically makes it pretty efficient and pretty easy to harness the power of the available hardware as of today and this brings the end and thank you for your attention can we reliably and no also law so it is stated in London the goal of just as the question you know I think you can you can repeat it all you can go on and on
42:00
a so of course it was what's as well as well as a this but again you have parallelism and parallelism that this kind of the most single simple scenario you can think of of wasn't pretty well get it of course it's kind 1 summer protocol the use within we use it actually but of course there are kind of bindings and in many application examples of really good use cases for that the so I faces sample that small question so suppose you're doing you use time series analysis of that's not been discussed but obviously that's something it's kind of hard to do in parallel in the columns of work very nice in parallel and as a result there were various apparel which should just on doing things with over that don't work necessarily like what's the besides compiling where some tricks you may be used to this is a question a curious pleasure and the and the selection of intensive of course 1 thing we elect concerned with all the data and finance and 1 of the major things of the 6 but that didn't get the question in the end models with dozens of so there are algorithms that are that you can really easily report something parallel and there are algorithms where it's not so easy that recognize for certain like very advanced penitentiary model say arena type models and ozone paralyzed that nicely and what which you just find it the problem is hard to parallelize was the best tactic to approach this problem and of course not everything is going well suited to be parallelized there were using for sample long working only with kind of squares the caller when you commit the crosssection you was a 100 thousand has a capitalize into 2 times 50 thousand that this is only the time Susan as do like I have a hundred thousand relations I can implemented on the 1st 50 thousand and 60 thousand is 1 approach to do uh without every algorithms like what's really because you the kind of the whole history or whatever built up so it is the crosssection of the information not to have Europe and produce results that as opposed to the use of this yummy diverted of that is kind of using parallelization for an optimized algorithm is kind of fusion of the with the goal of the world what you would do in this case is it well I don't have yeah within the can be to the easily paralyzed of course you would in any case go for the optimization of the algorithm by using fights on anything but think all the libraries is not only 5 some would have meant is kind of the on for some of you have a look at time t 3 which made this makes heavy use of the unknown justintime compiling kind of your objects of classes like on the fly dynamically optimized problem given at hand using justintime compiler what or call time compiling instead of of a slight difference at this is typically I think the project will take this and what its 1st optimized any given means that we have available only admins but I create not every algorithms that was used to be problem but again if you have 2 time series to analyze you can start to think about that again this is my my point except many similar similar task and this of course trivial case for personalization so if you're starting off with serial Python code things pretty obvious that using these and parallel tools will make it go faster but and I think a lot of people believe that Python is not what you should be using if you want to have efficient harmonization because of the additional overhead because if you're not processing the kill and obviously Python is a highlevel language sum of money benchmarks but just in real world applications like actual applications you everything what you say to people who would be tempted to stick with the 1st class to squeeze out the last bit of performance I do you find the Python is sufficient for his everyday needs please for this nation's so far we haven't ever had gone the route of going to C + + C for all client projects in the north for the things that we've implement so we're using for some processing tool for these politics or library but have simple easy a scaling and parallelization on the machine at least this is typically where things are on on larger machines with multiple cores with the huge memory the sculptor scenario that I have I mean many many I can understand people had a vision the skin like classes and so forth but Of course I need you have this order with by some as kind of the very good example we can say what you using I can decide whether I have 90 % Python % Cecil to say what I have something that looks like 90 % C and a bit of Python you can still notice in between so and this is the beauty of Python that you don't have usually these you know all things you can even go for like after profiling we assume that this is the real bottleneck of the whole thing is still for this year for this reason mentoring are part of a conference meet up in London a guy using well again did something in a similar because we thought this would be I know if this is the right thing but still you have the flexibility and and this if you I thought this is not you on that that integrates pretty well and of course the C and C + was of the 2 worlds apart the index the natively with so whenever I say well I have the support can use this better with C + + why not going to so many people doing that and the financial industry is kind of a standard to do that but we for the things we've been doing I can only send again for all our stuff and for stuff that we've been implementing for all clients we've always stuck to the possible of course using performance libraries and all that I mean under the hood and this comes down to and and other stuff but not on the on the top level implemented it In the beginning the root the