Python Profiling with Intel® VTune™ Amplifier

Video in TIB AV-Portal: Python Profiling with Intel® VTune™ Amplifier

Formal Metadata

Python Profiling with Intel® VTune™ Amplifier
Title of Series
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Release Date

Content Metadata

Subject Area
Python Profiling with Intel® VTune™ Amplifier [EuroPython 2017 - Talk - 2017-07-10 - PythonAnywhere Room] [Rimini, Italy] Python has grown in both significance and popularity in the last years, especially in the field of high performance computing and machine learning. When it comes to performance, there are numerous ways of profiling and measuring code performance—with each analysis tool having its own strengths and weaknesses. In this talk, we will introduce a rich GUI application (Intel® VTune™ Amplifier) which can be used to analyze the runtime performance of one’s Python application, and fully understand where the performance bottlenecks are in one’s code. With this application, one may also analyze the call-stacks and get quick visual clues where one’s Python application is spending time or wasting CPU cycles
Intel Goodness of fit Performance appraisal Software developer State of matter Software developer Stress (mechanics) Mathematical analysis Division (mathematics) Division (mathematics) Cartesian coordinate system Information technology consulting
Source code Programming language Archaeological field survey Java applet Mathematical analysis Formal language Field (computer science) Formal language Number Mathematics Duality (mathematics) Mathematics Coding theory Website Buffer overflow
Intel Distribution (mathematics) Virtual machine Archaeological field survey Field (computer science) Formal language Supercomputer Architecture Mathematics Electronic meeting system Core dump Cuboid Point cloud Scale (map) Distribution (mathematics) Software developer Numerical analysis Code Drop (liquid) Cartesian coordinate system Flow separation Thread (computing) Supercomputer Software Hill differential equation Mathematical optimization Scheduling (computing)
Intel Distribution (mathematics) Coroutine Number Architecture Mathematics Authorization Cuboid Mathematical optimization Tunis Computer architecture Point cloud Scale (map) Distribution (mathematics) Inheritance (object-oriented programming) Numerical analysis Code Drop (liquid) Instance (computer science) Cartesian coordinate system System call Measurement Thread (computing) Supercomputer Word Interface (computing) Mathematical optimization Scheduling (computing) Laptop Library (computing)
Source code Overhead (computing) Overhead (computing) Arm Line (geometry) Software developer Code Cartesian coordinate system User profile Mixed reality Profil (magazine) Sampling (music) Befehlsprozessor Energy level Process (computing) Family
Asynchronous Transfer Mode User interface Line (geometry) Spherical cap Mixed reality Energy level Process (computing) Error message Multiplication Source code Overhead (computing) Information Code Core dump Funktionalanalysis Line (geometry) Cartesian coordinate system Scalability User profile Sampling (music) Internet service provider Right angle Energy level Ideal (ethics) Mathematical optimization
Module (mathematics) Intel Asynchronous Transfer Mode Code Binary code Maxima and minima Core dump Cartesian coordinate system Scalability Formal language Website Pattern language Musical ensemble Mathematical optimization Multiplication Mathematical optimization Physical system
Scale (map) Intel Asynchronous Transfer Mode Scripting language Demo (music) Multiplication sign Calculation Code Core dump Cartesian coordinate system Scalability Complex analysis Degree (graph theory) Web 2.0 Workload Mixed reality Software System programming Software framework Process (computing) Mathematical optimization Multiplication Mathematical optimization
Scripting language Metric system Constructor (object-oriented programming) Code Multiplication sign Sheaf (mathematics) Letterpress printing Mathematical analysis Complex analysis Machine code Timestamp Fraction (mathematics) Blog Read-only memory Befehlsprozessor Software Personal digital assistant Process (computing) Information Endliche Modelltheorie Text editor Physical system Scripting language Scale (map) Source code Beta function Code Login Instance (computer science) Funktionalanalysis Term (mathematics) User profile Workload Software Calculation System programming Revision control Game theory Freeware Metric system Mathematical optimization
Intel Statistics Metric system Distribution (mathematics) Code Multiplication sign Compiler Approximation Event horizon Regular graph Type theory Read-only memory Profil (magazine) Befehlsprozessor Personal digital assistant Tunis Physical system Social class Beta function Overhead (computing) Code Instance (computer science) Cartesian coordinate system Statistics User profile Type theory Befehlsprozessor Event horizon Sample (statistics) Workload Function (mathematics) Social class Sweep line algorithm Metric system Mathematical optimization
Intel Asynchronous Transfer Mode Statistics Overhead (computing) Distribution (mathematics) State of matter Line (geometry) Interactive television Source code Visual system 1 (number) Mereology Twitter Number Regular graph Graphical user interface Benchmark Mixed reality Profil (magazine) Befehlsprozessor File viewer Energy level Source code Overhead (computing) Distribution (mathematics) Open source Code Line (geometry) Funktionalanalysis Cartesian coordinate system Statistics Measurement Open set Sample (statistics) Workload Function (mathematics) Internet service provider Order (biology) Computing platform Right angle Window
Source code Overhead (computing) Digital filter Intel Distribution (mathematics) User interface Virtual machine Code Principal ideal domain Mathematical analysis User profile Process (computing) Profil (magazine) Revision control Drill commands Information Remote procedure call Freeware Multiplication Window Physical system
Intel Code Projective plane Configuration space Set (mathematics) Software testing Mathematical analysis
Scripting language Intel Implementation Code Multiplication sign Virtual machine Data storage device Mathematical analysis Cohen's kappa Line (geometry) Mereology Computer Subset Number Kernel (computing) Process (computing) Term (mathematics) Order (biology) Cuboid Algebra Window Tunis Resultant
Web page Intel Histogram Greatest element Multiplication sign Parallel computing Predictability Befehlsprozessor Core dump Information Lipschitz-Stetigkeit Drum memory Computing platform Physical system Scripting language Histogram Sine Lemma (mathematics) Cartesian coordinate system Arithmetic mean Befehlsprozessor Lie group IRIS-T Computing platform Thumbnail Data Encryption Standard Multi-core processor
Scripting language Histogram Execution unit Greatest element Multiplication INTEGRAL User interface Multiplication sign Source code Maxima and minima Bit Line (geometry) Stack (abstract data type) System call Inflection point Befehlsprozessor Intrusion detection system Befehlsprozessor Matrix (mathematics) Momentum Gamma function Binary multiplier
Execution unit Multiplication Greatest element Game controller NP-hard Run time (program lifecycle phase) User interface Multiplication sign MIDI Cloud computing Line (geometry) Cartesian coordinate system Revision control Befehlsprozessor Matrix (mathematics) Moving average Selectivity (electronic) Gamma function Freeware Summierbarkeit Uniform boundedness principle Maß <Mathematik>
Slide rule Empennage Intel Stack (abstract data type) Instance (computer science) Mereology System call Mixed reality Normed vector space Object (grammar) Gamma function Metric system Window
Intel Asynchronous Transfer Mode System call Code Compiler Mathematical analysis Stack (abstract data type) Instance (computer science) Binary file Cartesian coordinate system Formal language Goodness of fit Mixed reality Befehlsprozessor Hill differential equation Object (grammar) Library (computing) Extension (kinesiology)
Scripting language Asynchronous Transfer Mode Intel Information Parallel computing Code Compiler Mathematical analysis Instance (computer science) Binary file Group action Cartesian coordinate system Statistics Formal language Computer programming Wave packet Product (business) User profile Mixed reality Semiconductor memory Befehlsprozessor Software Freeware Extension (kinesiology)
Intel Beta function Electric generator Information Parallel computing Building Multiplication sign Real number Projective plane Code Student's t-test Group action Cartesian coordinate system Revision control Performance appraisal User profile Frequency Mixed reality Universe (mathematics) Software design pattern Software testing Information
Presentation of a group Greatest element Code Multiplication sign Source code Mereology Computer programming Mathematics Mechanism design Semiconductor memory Cuboid Flag Diagram Damping Software framework Endliche Modelltheorie Extension (kinesiology) Information security Rhombus Social class Physical system Scripting language Block (periodic table) Linear regression Binary code Sampling (statistics) Principal ideal domain Funktionalanalysis Instance (computer science) Flow separation Degree (graph theory) Type theory Category of being Process (computing) Hexagon Repository (publishing) Internet service provider Order (biology) Right angle Summierbarkeit Pattern language Metric system Resultant Statistics Overhead (computing) Image resolution Virtual machine Device driver Student's t-test Web browser Protein Computer Machine vision Computer icon Number 2 (number) Product (business) Goodness of fit Profil (magazine) Helmholtz decomposition Software testing Tunis Mathematical optimization Module (mathematics) Distribution (mathematics) Multiplication Information Interface (computing) Mathematical analysis Plastikkarte Line (geometry) Cartesian coordinate system System call Personal digital assistant Interpreter (computing) Library (computing)
I'm trial and I am a technical consulting engineer that that for me again at the at Intel and developer predicts division team and we're based in Munich Germany and their duties focus will be on performance analysis of Python applications and the off we have to states of no denial that Biden is getting a lot of traction pillar of important these days and if you look at what our friends at home good
able have published indeed and and has grown in popularity over the last years and in 2016 remains the number 1 most use language and also what is more surprising is that by 10 remains the number 1 programming language in hiring demand so on the it's a great skill to half uh in this decade to be prefer to be proficient in Python and yeah when it comes to performance analysis there are certain fields that are kind of
driving the technologies of the future and technologies that are kind of really important right now and these fields I would say would be mathematics and Data Science and to get my facts straight to get the numbers correct I went to StackOverflow our favorite website where we have problems and overflow
shows me that indeed bite and he's the most use language in the fields of mathematics and Data Science now you may think of uh the mouth does make sense if you model the percentage it doesn't make up to 100 where that's because of those approximately 50 K 8 people who responded to the survey of the shows several languages but most of them just by Biden over 50 % of this so let's quiet the impressive so Martin data science these FIL it's actually drive high-performance computing core HPC and other fields like altercation intelligence machine learning all deep learning and this and realizes that these fields all going to define the future and so we have worked really hard to release distribution of software which we call the
and their distribution for Python and it comes up out of the box to highly optimized sublibraries so duality to develop high-performance applications with Python we made it super easy to use super easy to install of
packages can be easily downloaded from anaconda um all in all even yeah providing the audience and so forth I would distribution of Python comes with a highly optimized libraries like num pi side by side can learn which actually at the base leverages in debt and k l which is short for Math Kernel Library now in itself and Gail if you've nest heard about it a few words about it the and it's mating assembly it's super optimize mathematical routines have been designed to make the most out of the Indian architecture
how many calls you half of the and and make use of the latest instruction set architecture for instance the VAX CVX 512 whatever you have out of the box make the most of the authorization so you don't have to worry about this by using the and distribution for Python now the performance is really important so how do we actually measure performance of about an application and the interview to number 5 the tune amplifier is
a good provide it is a profiler that allows
you to know where all the performance problems in your sulfur it has been developed over many years over 15 years and it's still in development were getting a lot of improvements Dave everyday work our engine is a working holiday and arms over the last 4 years we have worked on it and providing guidance and what is great is that it comes with its own low overhead something technology which is unrivaled know this profile is able to get performance data has been as good as an intelligent amplifier so there are some techniques how we are able to get performance data with low overhead so basically when big brother is watching there is no big impact on the performance of your real applications the we've been television amplifier we are able to get and precise
you for this and we were able to get precise aligned level information but some providers allow you to do that but others you may use it and give the data at the function level so basically you have to kind of guess where the performance if you have a big function but the TrindiKit right to the souls lines way out and there bottlenecks now bottleneck is basically like you know the bottle and the neck this is where the performance is kind of capped and our goal is to find those errors you could and optimize on them In all at the eventually opt and increase the performance of the application and what is also great is that we can not only
analyze the bite and performance but also site language and um if applicable NEC code that occupied and good is calling essentially you can analyze your whole system and get data about just and not just that Biden band and binary and and the pattern has been called but other modules that can be built in C or C + + the so early
in the coming 10 to 15 minutes so I'll be talking about why Python optimization is important so how do we find those bottlenecks and uh
a very short overview of the various providers available on the market and then on a very quick demo of how the degree looks like and what you see in the pool and if you about mix were profiling the so why do we need Biden optimization well it's no denial item
is everywhere Biden is being used in a lot of application that to date need a lot of time performance so if you look at the web frameworks and Django from TurboGears um flask so all these
require that stuff be done really really really fast and they're built systems like a storms build bought them then if you use it in your company but that we use did not for instance actually to build on the package for intelligent amplifier and other polls across Intel the scientific calculations and there are tools that freak and it's a free modeling software that that has a large sections built-in Python and so these require high-performance there also that pulls if if you if you know the and even Linux made out of Titan games there are games like civilization for other seems for uh these are Python-based games of this you want your game to to be efficient and run fast right so and how do we measure the performance the there are a couple of techniques there is good examination can opened
editor and check the good that this can be very tedious if you don't own the code you have included it always a good a super large how would you check everything on but that's 1 way there is another way logging you basically the and entered pieces of code in your in your Python script and say OK print this time step here and then let me know at the end of my function how much time the fraction of runs this also tedious manual work and then there is profiling providing is basically the cred howling television amplify works on in a sense what we're going to do is gather metrics from the
system has applications running and then at the end of the rainbow are going to analyze all those metrics and make sense out of all the data that we get and what we're going to focus on CPU hotspot profiling and find places in your code where are your code is spending a lot of time on the CPU or wasting a lot of time or the other frittered application whether 1 friend is waiting on a lot and the doing anything all essentially stalling in finding those issues and removing them is wrong the way to go now profiling there are a couple of types of
profiling there is even based profiling of which is essentially them collecting data are when certain events happened for instance of an during a functional existing affection for loading the class and loading the glass so things like that so the dough certain events we get performance data there's also instrumentation where the target applications modified and on this basically the application provides itself and then there is something the statistical of profiling now this is how the 2 works the tune is a statistical performance profile there are some
caveats it's so to bear in mind and obviously of as a statistical method the larger the data the larger the damage applications running the more curated is so this is why I have underlying approximated them but I've also put in bold much more and much less intrusive so with this uh statistical method that we employ in order to measure performance of biotin applications were able to come to get new overhead performance profiles and a the longer your application runs the better the reasons I this is a short overview of the various some providers uh you may have seen or not and there may be others but these are the most common ones in then
the trend company fire and what is great is it with it is that it comes to the reach Heidi advance highly customizable and we've you work of in order to see quickly and visually well the problems what something that's windows and what is also nice is this line level profiling that the function of a bit right and the source line where you problems on and overhead very important but an interpreted world of only 1 . 1 x on performances and that's a really low number compared to other line profile is like line provide itself which has a 10 x performances so there's going to use and provided unusable you can go the state the profile gets you did at the function level with a relatively low overhead but then again it's the granularity is very calls and and also part other bite pulls on that come bundled in ideas engages to you and again function level to its performance hit the our tool works with basically every and distribution you may be using but even the
the bite distribution supplied by wind or whatever system using all our all obviously on in the entire distribution 4 by 10 which is built with ICC support for 2 . 7 x bite and free on and remote collection arrest such so you can be using a Windows machine and then you can remote profile in Linux machine where you bite and good is running so that's really great only you can attach
writing process if you bite and good cannot be stopped you can just attached to the PID and get performance data and analyzing performance is actually really simple some 3 basic steps greater
project in our told configure the various settings run interpret essentially and I did it small test just to show you how it works so I ha of actually good
by 10 is doing something very very simple a trade this piece of code I hope
it's not too small and can use it is a
good enough yeah look get it if m subsets good also this code is very simple not that of lines of code that's a 21 script but it does some computation somehow the competition so imagine seeing this in some high-performance kernels so why does this there a small main script and there are 2 parts 1 is going to use multiprocessing and create 2 2 terms 2 processes and then called multiplying which is essentially going to multiply as it says and who mattresses 8 times b and store it in the so we're going to agree 2 processes and do this on highly was quite badly made free nested and then from petition here the so if you guys do this don't do it it's really bad implementation of OK and um and then there is another method which is out of the box using number so this is the best motorbike OK so basically new algebra and then having to run the code of order anything my Linux that's a machine collected the results in order to save time and opened it in the tune here on Windows so this is how it
looks like I have it in my summary page an overview of the an overview of my
of the time that the application has run there is also the CPU time which is basically the time per CPU core here I see 113 which is which looks good because of a dual core system and the elapsed time I will put them was 57 times to approximate 200 so my good was actually quite paralyzed on and you can also see in the CPU usage histogram my CPU concurrency was to and that's great and some some collection platform but although it I will I opened to multiprocessors because of it to call system that doesn't mean that it was great because you know we free nested on it's not so nice that I also have and in this script I'm the providing the performance of this blast them by and good if I go in the bottom up the but she 1
more thing In the tough hotspots it has ordered history
where you need to spend time to optimize you could so if I go into the bottom of this it
has sold at all the various methods called in your Python script the and the we can see that the the aggregation of those 2 multiplies contributed to most of the time and because of course also collected the call stack I can go and drill down to how my method was called in my Pitons crib by can double click on it and it will open the source file and died at the source line where most time was spent so this what I've been a double click on that line of that on the call stack line and it has the dramatically opened the source script the and can move that line the bit here so I we can see that most of the time was spent in doing this matrix multiplication 26 per cent of CPU usage the and going back to the
bottom up we can see the timeline how active was my CPU over the whole runtime of my application you can see that for the tumor to process and that the package metaprocesses created by CP was active both processes while easy computing the matrix multiplication and then at the end of my own stupid multiplication I had the best 1 and this can be seen at the very tiny and here I can do so mean and filtering by selection and the the
this is zoomed in timeline there's a very tiny little piece on the main Fred which is fried ID free free for 5 and that was the last version using them by we can't even zooming further filter in resuming entertain so what this does is um it will get that time and I'm zooming in and then it down the time many demi during that time line which methods were being called so even more control and more by 1 what
you see so I can see that uh and for this little part here for instance Harry metrics predict was called it is a shared object so often by the the which she said shared object and the end the call stuck phone them by so going back to my
slides you are able to also run
mixed-mode analyzes so basically get performance uh information about
a bite and good and also from inside an old ladies and good being called in your application made C C + + and you get all these for instance here of shared object so that's an library and the other 1 is by
so biden script so the summary training the obligation obviously and is a good thing to everybody has to do it there are ways to do it in the Chinese it call for it on i'm because of in Australia by
2 muscle who's sitting in front of maybe not so interesting for you in television amplify he's a commercial product but they're always also to get it for free it's for free in the beta program so if you sign up for the beta 2008 and 18 that comes up with more advanced deparaffining capabilities will be tuned for instance getting detailed information but frittered applications and also memory consumption of its
available in the 18 version beta it's a free for testing evaluation for a long period of time it's also for free for all people in academics students professors universities anybody from academia for free the but only for companies
that turns work on real projects and generate money your require license just a small what about it I'm an engineer I don't talk about business but it will I think that might be relevant for you are you may get more information into talk so conducted by my colleague David of the is infrastructure design patterns with biotin on wednesday but what is more relevant to this talk will be probably the workshop on Thursday which is all about the hands on on how to train your application with our tools on this thank you very much for your attention
it it that I and thanks for your
talk um if I understand well you can annotate the source of Python from and also see to see line by line the time of execution we would be possible to undertake directly sigh from source and not the C + + or C source that the reason generated uh what they mean by annotated who all speakers there's instrumentation but the me more about annotation in your case I mean just as we solving the diagram but you can see actually the source of lines and the time that they took to execute the cumulative time this kind of profiling part of the food instead of showing the sea source that was generated from the size and if we can see that directly the lines of sight yeah are actually you install directly from the line of sight and OK and the in about it and yeah how does it work OK because the question was had without microphone so the application is already running it has a process ID how do we actually attached to it by their mechanism so your D know the PID right if it's running but also if it has a PID you can also know the name of the application and then in degree and you can do that provide the name of the process ID and the general attached to it and 1 other question on a you have C extension modules and you also need that model to be comply with the the block block so that you can sample from each year and if you don't have access to that like it's just the binary that came in the distributions yeah that's a very good question on in this case you would basically see uh Frank had that these which is basically a hex code for functions that you don't know the name of our piton found binary provided by a the distribution is built ICC with the developed flag so essentially you can see D don't invited side of of the of method names being used in the top right for an exam library of this you would like to have minus G to get into debate information for you could the and your Python Distribution comes with an eye on the distribution of of all the time these this is just 1 of the ways you can actually just do some huge she just had the and repository and then you can also do them installed that Anakin as a preferred way the there is an icon . down and some of those the thank you the the kind of the thing mentioned region is a statistical tie provider and we've seen some results of some of the code that you're running so that meant that matches with the much cation yeah I was wondering if the results that we've seen are actually the result of running the code maybe like a number of times 10 thousand times and taking some fixing or or was that just 1 of and which is displayed the results of also that's an excellent question in this case it was run once the so this is what you get right away but in order for your size to confirm that the data that I got actually makes sense and is true here and it just have many times you can have a channel the pattern script that Francis Crick many times and and also how it will be tuned comes with the command line and interfaces was taken at this 1 line that uh does a providing for usage the reserves and everything so you cannot be a script and automate the running of your program many times and have the tune wrapping your application it's a command line interface and this is how you can have your own built system all regression testing system and get data and if that's the case is this hobby behind the time is 3rd quite slow to to run this kind of analysis like in multiple times or this finding it doesn't so I was just wondering what time of how much time you have to spend to I have to say you run your code and thousand times and just this is from it you have any type of metrics the OK out this depends on the resolution of your analysis so in my case I did they come an analysis with a resolution of 10 ms which is quite big actually so if you want more data more resolution you can lower this time and how many times of deceit to the lower the time duration to get this samples the larger the date of the larger potentially the overhead and less security could be your results so it's playing around on in general uh anything longer than me to free seconds is good enough so high yet have the questions that can you attach a profile to a running processes into subprocesses to be built in a special way for that yeah I can just profiling and production of things that I think the question was also already saw cancers answers yes because you can touch there presents the 2nd question was you had the library showed a presentation of some of the time taken in see a card that line occurred had 2 calls it was not looking to infer brackets temp like former so to function calls in that can it decomposes in the in the browser to those 2 function calls and processed on each 1 to because you just showing the sum of the for that 1 many of these multi process so your question these you have created to multiprocessors they're making 2 function calls in 1 line to method calls of 1 line something was looking to infer bracket template former so you calling info and former yeah i the decomposer in the browser while in this case it to aggregate the diamond show you on that 1 source line the whole time for that but I think it's a bad practice to do this for code readability opinions at all you do is but I wait for the whole than I would add 1 more thing by the way In this case it you will see the source line because what she asked associating time resource line your source good but in the bottom of you you will see different functions to functions when you but the thing is when it Click on both functions you we go to the same but you will know that time for each function but how text here they would like to ask what the interpreter at the you should know Europe distribution and if you have lot like the uh modifications to the interpreted make it fast while uh the Acustics sonnets agreed on what when I got what I with what I got is that how is alignment done memory alignment of alone uh what interpreted use and have you might been changed the interpreter to optimize it yeah Canada for is this for me this is what I and of OK thank you Adam and yeah well our interpreter has been made from scratch and combined with ICC there were some changes on I don't know in detail what has changed but there were minor changes in the interpreter however all the libraries making use of having mathematics and these have been redesigned completed making use of MDL so this is the benefit of bringing with our and the distribution of Titan so that you guys when you do X PC-based applications submitted by 10 old machine-learning deep learning all even using as the case or frameworks like stencil through a cafe all these and autonomous driving has to kill the computer vision is decay from Intel that leverages the Biden distribution you get the performance out of the box so it ought to be like a math genius to good property or each super and and and suffer engineer with great skills in good optimization to create high-performance it's done out of the box here welcome but it may be already lunchtime and just 1 thing if have really interesting questions that you really want to get answers I will workshop on it just on this topic could be very useful for you on its on Thursday student and request of world class their users because I see that on my machine we can connect live with prosody but as I say I have a class out and measure the performance so all the work machine or is it possible for a question yes it is possible so yeah protein using MPI right yeah yeah not not not not not using the i-vector and just OK let me take an idea as an example if you have a cluster several nodes you bite and good is being running on all all you have the tune amplifier the drivers something driver on all those guys and with MPI G tool for instance you just MPI ran ge told um amplifier XTC CIA which is the command line interface told and then you Python script and it will do the job harder for you and get you there is it's it's magic it's pronounced very interesting things here the other other thank you