Python Profiling with Intel® VTune™ Amplifier
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 160 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/33799 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
IntelPerformance appraisalSoftware developerDivision (mathematics)Information technology consultingDivision (mathematics)Mathematical analysisCartesian coordinate systemSoftware developerProduct (business)Focus (optics)Goodness of fitStress (mechanics)State of matterUML
00:35
Coding theoryFormal languageJava appletMathematical analysisField (computer science)NumberProgramming languageFormal languageComputer animation
01:15
Source codeJava appletDuality (mathematics)Formal languageCoding theoryMathematicsArchaeological field surveyElectronic meeting systemSupercomputerIntelPoint cloudDistribution (mathematics)Mathematical optimizationNumerical analysisDrop (liquid)CodeThread (computing)Scheduling (computing)ArchitectureScale (map)Hill differential equationWebsiteNumberMathematicsField (computer science)Buffer overflowFormal languageRight angleSoftwareDistribution (mathematics)Virtual machineCartesian coordinate systemSoftware developerCuboidSupercomputerCore dumpFlow separationArchaeological field surveyArtificial neural networkComputer animationXML
02:59
IntelPoint cloudDistribution (mathematics)Mathematical optimizationNumerical analysisThread (computing)Scheduling (computing)ArchitectureScale (map)Drop (liquid)CodeLaptopUser interfaceLibrary (computing)Distribution (mathematics)Mathematical optimizationInheritance (object-oriented programming)WordNumberCoroutineMathematicsComputer animation
03:40
Scheduling (computing)Thread (computing)IntelPoint cloudDistribution (mathematics)Numerical analysisCodeDrop (liquid)ArchitectureScale (map)SupercomputerMathematical optimizationNumberMeasurementAuthorizationCuboidDistribution (mathematics)Computer architectureTunisCartesian coordinate systemInstance (computer science)System callCore dumpProfil (magazine)Code
04:16
Mixed realityLevel (video gaming)Overhead (computing)Line (geometry)Sampling (music)Source codeCodeUser profileProcess (computing)BefehlsprozessorProfil (magazine)SoftwareOverhead (computing)Software developerCartesian coordinate systemFamilyArmGoodness of fitLevel (video gaming)Line (geometry)InformationComputer animation
05:12
Process (computing)Mixed realityCodeLine (geometry)Level (video gaming)Sampling (music)Overhead (computing)Source codeUser profileUser interfaceIdeal (ethics)Mathematical optimizationCore dumpScalabilityMultiplicationAsynchronous Transfer ModeError messageSpherical capLevel (video gaming)Line (geometry)FunktionalanalysisRight angleInformationCartesian coordinate systemInternet service providerSource codeOrder (biology)Profil (magazine)Mathematical optimizationCodeComputer animation
06:07
Module (mathematics)Pattern languageBinary codeCodeFormal languageCartesian coordinate systemWebsiteMusical ensemblePhysical systemMathematical optimizationComputer fileLecture/Conference
06:39
Mathematical optimizationCore dumpScalabilityMultiplicationIntelAsynchronous Transfer ModeMaxima and minimaSoftwareSystem programmingComplex analysisScripting languageScale (map)CodeProcess (computing)WorkloadMathematical optimizationDemo (music)WordProfil (magazine)Graphical user interfaceMixed realityDegree (graph theory)Computer animation
07:09
System programmingComplex analysisScripting languageCalculationMathematical optimizationSoftwareCodeScale (map)Process (computing)WorkloadWeb 2.0Multiplication signCartesian coordinate systemSoftware frameworkSoftwareGame theoryCalculationPhysical systemFreewareInstance (computer science)Sheaf (mathematics)Endliche ModelltheorieBuildingComputer animation
08:20
InformationMachine codeText editorTerm (mathematics)CodeConstructor (object-oriented programming)Mathematical optimizationLoginRevision controlSource codeTimestampBlogMathematical analysisBefehlsprozessorBeta functionRead-only memoryMetric systemUser profilePersonal digital assistantCodeLetterpress printingMetric systemProfil (magazine)Scripting languageLoginFunktionalanalysisMultiplication signPhysical systemFraction (mathematics)Computer animation
09:07
Mathematical optimizationMetric systemBefehlsprozessorPersonal digital assistantUser profileCodeRead-only memoryBeta functionSweep line algorithmType theoryProfil (magazine)Metric systemCartesian coordinate systemMultiplication signBefehlsprozessorCodePhysical systemThread (computing)Computer animation
09:51
Type theoryEvent horizonSocial classFunction (mathematics)CompilerUser profileStatisticsSample (statistics)Regular graphDistribution (mathematics)WorkloadOverhead (computing)IntelApproximationEvent horizonProfil (magazine)TunisCartesian coordinate systemStatisticsSocial classInstance (computer science)FunktionalanalysisComputer animation
10:34
StatisticsRegular graphSample (statistics)Distribution (mathematics)Function (mathematics)WorkloadOverhead (computing)IntelComputing platformAsynchronous Transfer ModeInteractive televisionMixed realityGraphical user interfaceFile viewerLine (geometry)Visual systemCodeOpen setSource codeOpen sourceOverhead (computing)ResultantMeasurementProfil (magazine)1 (number)Cartesian coordinate systemOrder (biology)StatisticsMultiplication signInternet service provider
11:28
Graphical user interfaceFile viewerOpen sourceSource codeCodeOpen setBefehlsprozessorBenchmarkMixed realityInteractive televisionFunction (mathematics)Asynchronous Transfer ModeIntelLine (geometry)Visual systemDistribution (mathematics)Level (video gaming)UsabilityOrder (biology)Source codeIntegrated development environmentFunktionalanalysisFile viewerRight angleLine (geometry)AudiovisualisierungGraphical user interfaceOverhead (computing)Profil (magazine)TwitterNumberUser interfaceMereologyState of matterComputer animation
12:45
User profileCodeInformationOverhead (computing)Revision controlUser interfaceSource codeMultiplicationDigital filterUser interfaceProfil (magazine)Remote procedure callDistribution (mathematics)Physical systemVirtual machineCodeFreewareComputer animation
13:14
User profileInformationOverhead (computing)IntelCodeRevision controlMultiplicationDigital filterDrill commandsSource codeUser interfaceMathematical analysisProcess (computing)CodePrincipal ideal domainProjective planeInterpreter (computing)Software testingConfiguration spaceSet (mathematics)Computer animation
13:44
IntelMathematical analysisCohen's kappaThumbnailData Encryption StandardLie groupInformationIRIS-TPredictabilityCodeGoodness of fitWeb pageCartesian coordinate systemMultiplication signThumbnailData storage deviceKernel (computing)Order (biology)Scripting languageVirtual machineUser interfaceImplementationCuboidLine (geometry)Binary multiplierComputerResultantMereologyProcess (computing)AlgebraTunisTerm (mathematics)SubsetNumberComputer animationSource code
15:48
BefehlsprozessorHistogramLipschitz-StetigkeitSineInformationComputing platformLemma (mathematics)Drum memoryBefehlsprozessorArithmetic meanMultiplication signPhysical systemParallel computingScripting languageMulti-core processorCore dumpComputing platformGreatest elementHistogramCartesian coordinate systemCodeLoop (music)Computer animation
16:48
INTEGRALBefehlsprozessorHistogramInflection pointMaxima and minimaIntrusion detection systemUser interfaceExecution unitGamma functionMomentumNP-hardCloud computingSummierbarkeitMIDIMehrprozessorsystemLine (geometry)Stack (abstract data type)System callMultiplication signSelectivity (electronic)Cartesian coordinate systemBefehlsprozessorSource codeScripting languageMultiplicationRun time (program lifecycle phase)Binary multiplierCodeProcess (computing)Greatest elementMatrix (mathematics)BitComputer animation
19:00
Moving averageMIDIGamma functionUniform boundedness principleMaß <Mathematik>Line (geometry)Multiplication signGame controllerFreewareRevision controlThread (computing)Power (physics)Zoom lensComputer animation
19:37
Stack (abstract data type)Normed vector spaceGamma functionEmpennageObject (grammar)System callMetric systemInstance (computer science)MereologyMathematical analysisMixed realityMatrix (mathematics)Product (business)Asynchronous Transfer ModeComputer animation
20:11
IntelWindowMixed realityMathematical analysisStack (abstract data type)CodeExtension (kinesiology)Formal languageCompilerBinary fileAsynchronous Transfer ModeSystem callHill differential equationBefehlsprozessorSlide ruleCodeInformationGoodness of fitInstance (computer science)Cartesian coordinate systemLibrary (computing)Object (grammar)Scripting languageMachine codeComputer animation
20:53
BefehlsprozessorCodeExtension (kinesiology)CompilerFormal languageBinary fileMathematical analysisMixed realityAsynchronous Transfer ModeStatisticsIntelUser profileGroup actionSoftwareParallel computingWave packetScripting languageCartesian coordinate systemFreewareProduct (business)InformationComputer programmingInstance (computer science)Semiconductor memoryProfil (magazine)Computer animation
21:40
CodeIntelUser profileMixed realityGroup actionParallel computingBuildingInformationBeta functionProjective planeRevision controlFreewareMultiplication signSoftware testingPerformance appraisalStudent's t-testFrequencyUniverse (mathematics)Software design patternCartesian coordinate systemElectric generatorInformationReal numberLecture/ConferenceComputer animation
22:36
Virtual machineMathematical optimizationMathematicsInterpreter (computing)File formatMultiplication signComputerPhysical systemSummierbarkeitMachine visionFunktionalanalysisScripting languageLine (geometry)Source codeComputer programmingDevice driverPoisson-KlammerTemplate (C++)Total S.A.CuboidDistribution (mathematics)System callSemiconductor memoryInheritance (object-oriented programming)CodeMathematical analysis2 (number)StatisticsType theoryFlagHexagonInformationModule (mathematics)Software testingMetric systemBuildingRepository (publishing)Linear regressionCartesian coordinate systemBinary codeCASE <Informatik>DampingProcess (computing)MereologyDiagramLibrary (computing)Sampling (statistics)Extension (kinesiology)ResultantOrder (biology)LoginMultiplicationWeb browserSoftware frameworkPresentation of a groupSlide ruleProfil (magazine)Installation artSoftware engineeringComputer virusProduct (business)Image resolutionMatrix (mathematics)Computer iconInternet service providerGreatest elementPlastikkarteInstance (computer science)RhombusTunisNumberUser interfaceHelmholtz decompositionDegree (graph theory)Principal ideal domainPattern languageOverhead (computing)Goodness of fitRight angleMechanism designCategory of beingFlow separationBlock (periodic table)ProteinStudent's t-testSocial classEndliche ModelltheorieInformation securityLecture/ConferenceMeeting/Interview
Transcript: English(auto-generated)
00:05
I'm Shailen. I am a Technical Consulting Engineer at Intel in the Developer Products Division team, and we are based in Munich, Germany. Today's focus will be performance analysis of Python applications.
00:25
We have to say it's of no denial that Python is getting a lot of traction, a lot of importance these days. If you look at what our friends at CodeEval have published, indeed, Python has grown in popularity over the last years,
00:43
and in 2016, Python remains the number one most used language. Also, what is more surprising is that Python remains the number one programming language in hiring demand.
01:01
It's a great skill to have in this decade to be proficient in Python. When it comes to performance analysis, there are certain fields that are kind of driving the technologies of the future,
01:20
and technologies that are kind of really important right now. These fields, I would say, would be mathematics and data science. To get my facts straight, to get the numbers correct, I went to StackOverflow, our favorite website where we have problems. StackOverflow shows me that indeed Python is the most used language in the fields of mathematics and data science.
01:49
Now, you may think the math doesn't make sense. If you add all the percentages, it doesn't make up to a hundred. Well, that's because out of those approximately 50,000 people who responded to the survey,
02:03
they chose several languages, but most of them chose Python, over 50% of this. So, that's quite impressive. Math and data science, these fields actually drive high-performance computing, or HPC,
02:22
and other fields like artificial intelligence, machine learning, or deep learning. Intel realizes that these fields are going to define the future, and so we have worked really hard to release a distribution of software, which we call the Intel Distribution for Python.
02:43
It comes out of the box with highly optimized sub-libraries to allow you to develop high-performance applications with Python. We made it super easy to use, super easy to install.
03:01
Packages can be easily downloaded through Anaconda, or even Yum, providing the RPMs. Our distribution of Python comes with highly optimized libraries like NumPy, SciPy, Scikit-learn, which actually, at the base, leverages Intel MKL, which is short for Math Kernel Library.
03:27
Now, in itself, MKL, if you've not heard about it, a few words about it, it's made in assembly. It's super optimized. Mathematical routines have been designed to make the most out of the Intel architecture,
03:42
how many cores you have, make use of the latest instruction set architecture, for instance, AVX, AVX 512, whatever you have, out of the box make the most of vectorization, so you don't have to worry about this by using the Intel distribution for Python.
04:02
Now, performance is really important, so how do we actually measure performance of a Python application? Enter Intel VTune Amplifier. VTune Amplifier is a code profiler. It is a profiler that allows you to know where all the performance problems in your software.
04:25
It has been developed over many years, over 15 years, and it's still in development. We're getting a lot of improvements every day. Our engineers are working hard, and over the last four years, we have worked on profiling Python.
04:41
And what is great is that it comes with its low overhead sampling technology, which is unrivaled. No other profiler is able to get performance data as good as Intel VTune Amplifier. So there are some techniques how we are able to get performance data with low overhead. So basically when Big Brother is watching, there is no big impact on the performance of your real application.
05:07
With Intel VTune Amplifier, we are able to get precise line level information.
05:32
Some profilers allow you to do that, but others you may use give you data at the function level. So basically you have to kind of guess where the performance is if you have a big function.
05:43
But with VTune you get right to the source line where there are bottlenecks. Now a bottleneck is basically like the bottle and the neck. This is where the performance is kind of capped. And our goal is to find those errors in your code and optimize on them.
06:01
In order to eventually increase the performance of your application. What is also great is that we can not only analyze the Python performance, but also cite the language and, if applicable, any C code that your Python code is calling.
06:22
Essentially you can analyze your whole system and get data about not just the Python binary and the Python files being called, but other modules that can be built in C or C++. So in the coming 10 to 15 minutes, I'll be talking about why Python optimization is important.
06:46
How do we find those bottlenecks? And a very short overview of the various profilers available on the market. And then a very quick demo of how the GUI looks like. What you see in the tool.
07:02
And a few words about mixed-mode profiling. So why do we need Python optimization? Well, it's no denial. Python is everywhere. Python is being used in a lot of applications that today need a lot of performance. So if you look at web frameworks, Django, Jobogears, Flask.
07:24
So all these require that stuff be done really, really, really fast. There are build systems like Scons, Buildbot. I don't know if you use it in your company, but we use Buildbot, for instance, actually to build the package for Intel Vision Amplifier and other tools across Intel.
07:44
Scientific calculations. There are tools like FreeCAD. It's a 3D modeling software that has large sections built in Python. So these require high performance. There are also other tools if you know Gamepan Linux made out of Python.
08:02
Games. There are games like Civilization 4, The Sims 4. These are Python-based games. Obviously you want your game to be efficient and run fast, right? So how do we measure the performance? There are a couple of techniques. There is code examination. You can open the editor and check the code.
08:25
This can be very tedious if you don't own the code, you haven't coded it, or if the code is super large, how would you check everything? But that's one way. There is another way, logging. You basically enter pieces of code in your Python script and say,
08:45
OK, print this time step here, and then let me know at the end of my function how much time that function run. This is also tedious, manual work. Then there is profiling. Profiling is basically the cred how Intel Vision Amplifier works.
09:05
In a sense, what we are going to do is gather metrics from the system as the application is running, and then at the end of the run we are going to analyze all those metrics and make sense out of all the data that we get.
09:20
We are going to focus on CPU hotspot profiling and find places in your code where your code is spending a lot of time on the CPU or wasting a lot of time, or if you have a threaded application where one thread is waiting on a lock and not doing anything, or essentially stalling, finding those issues and removing them is the way to go.
09:48
Now, profiling. There are a couple of types of profiling. There is event-based profiling, which is essentially collecting data
10:02
when certain events happen. For instance, entering a function or exiting a function, or loading a class, unloading a class, so things like that. At those certain events we get performance data. There is also instrumentation, where the target application is modified and basically the application profiles itself.
10:24
And then there is sampling, statistical profiling. Now, this is how vTune works. vTune is a statistical performance profiler. There are some caveats to bear in mind.
10:40
Obviously, as a statistical method, the larger the data, the larger the time that your application is running, the more accurate it is. So this is why I have underlined approximate there, but I have also put in bold, much less intrusive. So with this statistical method that we employ in order to measure performance of Python applications,
11:02
we are able to get low overhead performance profiles. And the longer your application runs, the better the results. This is a short overview of the various profilers you may have seen, or not.
11:25
There may be others, but these are the most common ones. Intel vTune amplifier, what is great with it is that it comes with a rich, highly advanced, highly customizable GUI viewer in order to see quickly and visually where are the problems.
11:43
It works on Linux, Windows, and what is also nice is this line-level profiling, not at the function level, but right at the source line where your problems are. And overhead, very important, Python interpreted world, only 1.1x performance,
12:01
and that's a really low number, compared to other line profilers, like line-profiler itself, which has a 10x performance hit. So basically when you use line-profiler, it's unusable. You get bogus data. C-profile gets you data at the function level with a relatively low overhead, but then again, the granularity is very coarse.
12:24
And there are also other Python tools that come bundled in IDEs like Visual Studio. Again, function level towards performance hit.
12:41
Our tool works with basically every Python distribution you may be using, even the Python distribution supplied by Ubuntu or whatever system you're using, or our own, obviously, the Intel distribution for Python, which is built with ICC. Support for 2.7x, Python-free, and remote collection over SSH,
13:04
so you can be using a Windows machine, and then you can remote profile a Linux machine where your Python code is running, so that's really great. You can attach to a running process if your Python code cannot be stopped. You can just attach to the PID and get performance data.
13:26
And analyzing performance is actually really simple. Some free basic steps. Create a project in our tool. Configure the various settings. Run interpret. Essentially, I did a small test just to show you how it works.
13:46
So I have actually a code in Python. It's doing something very, very simple.
14:01
I will show you this piece of code. I hope it's not too small. Can you see it? Is it good enough? Yeah? Okay, I'll give you a thumbs up, so it's good. So this code is very simple, not a lot of lines of code. That's only one script, but it does some heavy computation.
14:23
So imagine seeing this in some high-performance kernels. What it does is there's a small main script. There are two parts. One is going to use multiprocessing and create two subprocesses.
14:43
And then call multiply, which is essentially going to multiply as it says two matrices. A times B and store it in C. So we're going to create two subprocesses and do this highly, well, quite badly made, free nested looped computation here.
15:10
So if you guys do this, don't do it. It's a really bad implementation. And then there is another method, which is out of the box using NumPy.
15:21
So this is the blast multiply. So basically linear algebra. And then we're going to run the code. I've already run it in my Linux virtual machine. I collected the results in order to save time and opened it in VTune here on Windows. So this is how it looks like. I have in my summary page an overview of the time that the application has run.
15:54
There is also the CPU time, which is basically the time per CPU core. Here I see 113, which looks good because I have a dual core system.
16:05
And elapsed time, my wall clock time was 57 times 2, approximately 100. So my code was actually quite parallelized. And you can also see in the CPU usage histogram, my CPU concurrency was 2. And that's great.
16:22
And some collection platform. But although I opened two multiprocessors because I have a two core system, that doesn't mean that it was great. Because, you know, we're a free nested loop. It's not so nice. That's why I also, in this script, I'm also profiling the performance of this blast numpy code.
16:47
If I go in the bottom up... Oh, actually one more thing. In the top hotspot, it has already listed where you need to spend time to optimize your code. So if I go into the bottom up, it has sorted all the various methods called in your Python script.
17:07
And you can see that multiply, the aggregation of those two multiplies contributed to most of the time. And because I've also collected the call stack, I can go and drill down to how my method was called in my Python script.
17:28
I can double click on it. And it will open the source file and tag at the source line where most time was spent. So this is what I've done. I've double clicked on that call stack line and it has automatically opened the source script.
17:46
I can move that line a bit here. So we can see that most of the time was spent in doing this matrix multiplication.
18:01
26% of CPU usage. And going back to the bottom up, we can see the timeline. How active was my CPU over the whole runtime of my application. You can see that for the two multiprocesses that the package multiprocess has created, my CPU was active.
18:25
Both processes were busy computing the matrix multiplication. And then at the end of my stupid multiplication, I had the blast one. And this can be seen at the very tiny end here. I can zoom in and filter in by selection.
18:43
This is the zoomed in timeline.
19:07
There is a very tiny little piece on that main thread which is thread ID 3345. And that was the blast version using NumPy. We can even zoom in further.
19:24
Filter in, zoom in and filter in. So what this does is it will get that timeline. I'm zooming in and then filter on the timeline. You tell me during that timeline which methods were being called. So even more control and more power in what you see.
19:42
So I can see that for this little part here, for instance, array matrix product was called. It is a shared object, so NumPy built with C, so it's a shared object. And the call stock for NumPy.
20:09
Going back to my slides, you are able to also run mix mode analysis.
20:31
So basically get performance information about your Python code and also Cython or native code being called in your application, be it C, C++.
20:45
And you get all these, for instance here, shared object. So that's a native library and the other one is Py, so Python script. So as a summary, tuning your application obviously is a good thing.
21:06
Everybody has to do it. There are ways to do it. vtune is a tool for it. Because I've been asked earlier by Thomas who is sitting in front, maybe that's interesting for you. Intel vtune amplifier is a commercial product,
21:22
but there are ways also to get it for free. It's for free in the beta program. So if you sign up for the beta, 2008 and 2018, that comes up with more advanced profiling capabilities for vtune. For instance, getting detailed information about fretted applications and also memory consumption.
21:40
It's available in the 2018 version beta. It's for free for testing evaluation for a long period of time. It's also for free for all people in academics. Students, professors, universities, anybody from academia. It's for free. But only for companies that work on real projects
22:03
and generate money, you require a license. Just a small word about it. I'm an engineer. I don't talk about business, but I think that might be relevant for you. You may get more information in two talks conducted by my colleague David Louis. There is infrastructure design patterns with Python on Wednesday,
22:23
but what is more relevant to this talk would be probably the workshop on Thursday, which is all about the hands-on on how to tune your application with our tools. On this, thank you very much for your attention.
22:53
Hi, and thanks for your talk. If I understand well, you can annotate the source of Python and also C
23:02
to see line-by-line the time of execution. Will it be possible also to annotate directly siphon source and not the C++ or C source that this siphon generated? What do you mean by annotate? Also because there is instrumentation,
23:20
but tell me more about annotation in your case. I mean, just as we saw in the diagram that you can see actually the source lines and the time that they took to execute, the cumulative time, this kind of profiling part. If instead of showing the C source that was generated from the siphon, if we can see that directly with the lines of siphon.
23:43
Yeah, actually you will start directly from the line of siphon. Okay.
24:00
Yeah, how does it work? Okay, because the question was said with that microphone. So the application is already running. It has a process ID. How do you actually attach to it? Well, there are mechanisms. So you already know the PID, right? If it's running. Also, if it has a PID, you can also know the name of the application. And then in the GUI,
24:21
you can either provide the name or the process ID and which one will attach to it. And one other question. When you have C extension modules, you also need that module to be compiled with the debug flag so that you can sample from it? Yeah. And if you don't have access to that, like if it's just the binary that came in your distribution?
24:42
Yeah, that's a very good question. Yeah, in this case, you will basically see a func at deadbeef, which is basically a hex code for functions that you don't know the name. Our Python binary provided by the distribution
25:00
is built with ICC with a debug flag. So essentially, you can see deep down in Python itself the method names being used. That's for Python. For an external library, obviously you would like to have minus G to get the debug information for your code. And your Python distribution comes with anaconda distribution
25:21
or you have other channels? This is just one of the ways. You can actually just add the repository and then you can also do yum install. But anaconda is a preferred way. There is anaconda, conda and some others.
25:41
Thank you. Hello. So you mentioned vtune is a statistical type profiler and we've seen some results of some of the code that you're running, so the matrix multiplication. I was wondering if the results that we've seen
26:02
are actually the results of running the code maybe like a number of times, 10,000 times, and taking some statistics over those, or was that just a one-off run and we just split the results of them? Also, that's an excellent question. In this case, it was run once. So this is what you get right away,
26:21
but in order for yourself to confirm that the data that I got actually makes sense and is true, you run it yourself many times. You can have actually another Python script that runs your script many times. And also, our tool, vtune, comes with a command line interface as well, so you can have one line that does the profiling for you,
26:42
saves the results and everything. So you can have your script and automate the running of your program many times and have vtune wrapping your application in its command line interface. And this is how you can have your own build system or regression testing system and get data.
27:02
And if that's the case, how do you find the time? Is it quite slow to run this kind of analysis, like running multiple times? This one I didn't get. I was just wondering how much time do you have to spend to run your code 10,000 times
27:21
and draw statistics from it? Do you have any type of metrics on that? Okay. This depends on the resolution of your analysis. So in my case, I did an analysis with a resolution of 10 milliseconds, which is quite big actually. So if you want more data, more resolution,
27:42
you can lower this time. And how many times? Obviously, the lower the time duration to get the samples, the larger the data, the larger potentially the overhead, and less accurate would be your result. So it's playing around. In general, anything longer than two, three seconds is good enough.
28:10
Hi. A couple of questions. Can you attach the profiler to a running process? And does that process have to be built in a special way for that? So you can do some profiling and production kind of thing.
28:21
I think that question was asked already. So the answer is yes. You can attach to a running process. The second question was, you had an early slide where you showed a presentation of the time taken on a particular line of code. And that line of code had two calls. It was like logging.info brackets template.format.
28:40
So there's two function calls in there. Can you decompose that in the browser to those two function calls and the process time that each one took? Because you're just showing the sum of the total for that line. And when you have these multi... The multi-process. So your question is, you have created two multi-processes? No, no, no. I'm making two function calls in one line,
29:02
two method calls in one line. It's something like logging.info bracket template.format. So you're calling info and format in the one line. Can you decompose that in the browser? Well, in this case, it will aggregate the time and show you on that one source line the whole time for that. I think it's a bad practice to do this for code readability.
29:25
In my opinion. I don't know how you do it. But wait, hold on, hold on. I will add one more thing, by the way. In this case, you will see the source line because we're actually associating time
29:41
to a source line in your source code. But in the bottom view, you will see different functions, two functions. But the thing is, when you click on both functions, you will go to the same source line. But you will know the time for each function. Here.
30:01
I would like to ask, what interpreter do you use in your distribution? And if you have applied modifications to the interpreter to make it faster. The acoustics are not so great. What I got is that, how is alignment done?
30:24
Memory alignment? No, no, no. What interpreter do you use? And have you made any changes to the interpreter to optimize it? Can anybody rephrase this for me? Please. One second. Oh, okay.
30:41
Thank you, Adam. Well, our interpreter has been made from scratch and compiled with ICC. There were some changes. I don't know in detail what has changed, but there were minor changes in the interpreter. However, all the libraries making use of heavy mathematics,
31:01
these have been redesigned completely, making use of MKL. So this is the benefit we're bringing with our Intel distribution of Python, so that you guys, when you do HPC-based applications made in Python, or machine learning, deep learning, or even using SDKs or frameworks like TensorFlow, Caffe,
31:21
or even the autonomous driving SDK or the computer vision SDK from Intel that leverages the Python distribution, you get the performance out of the box. So you don't have to be a math genius to code properly or a super software engineer with great skills
31:41
in code optimization to create high performance. It's done out of the box for you. Thanks. Welcome. It may be already lunchtime. Just one thing, if you have really interesting questions that you really want to get answered, our workshop on it, just on this topic,
32:01
could be very useful for you. It's on Thursday. Check it out. I have a question for cluster users, because I see that on my machine it can connect to the process, but if I have a cluster, how can I measure the performance of all the workers' machines, or is it possible?
32:22
Great question. Yes, it is possible. So you're probably using MPI, right? Yeah. No, I'm not using MPI, but I'm using just... Okay, let me take MPI as an example. You have a cluster of several nodes. Your Python code is being run on all. You will have VTune amplifier,
32:40
the sampling driver on all those guys. And with MPI gtool, for instance, you just pass MPI run, gtool, amplifier xecl, which is the command line interface tool, and then your Python script, and it will do the job out of for you and get you the results.
33:00
It's magic. It's really nice. Very interesting. Thanks. Yeah. Okay, thank you.