Python and PyPy performance (not) for dummies
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Part Number | 161 | |
Number of Parts | 173 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/20085 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Place | Bilbao, Euskadi, Spain |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
EuroPython 2015161 / 173
5
6
7
9
21
27
30
32
36
37
41
43
44
45
47
51
52
54
55
58
63
66
67
68
69
72
74
75
77
79
82
89
92
93
96
97
98
99
101
104
108
111
112
119
121
122
123
131
134
137
138
139
150
160
165
167
173
00:00
Data Encryption StandardCore dumpGamma functionDataflowPersonal identification numberTask (computing)BenchmarkCausalityIntegrated development environmentSystem programmingMetropolitan area networkOverhead (computing)Event horizonInterrupt <Informatik>Stack (abstract data type)Density of statesFrame problemLevel (video gaming)Just-in-Time-CompilerCore dumpBitSoftware developerMereologyProjective planeNumberCategory of beingPhysical systemTask (computing)Functional (mathematics)Multiplication signProfil (magazine)Different (Kate Ryan album)CASE <Informatik>Internet service provider2 (number)Computer programAverageWordQuicksortVirtual machineStatisticsScaling (geometry)CodeSampling (statistics)BefehlsprozessorInteractive televisionService (economics)Goodness of fitInformationOverhead (computing)Point (geometry)Inheritance (object-oriented programming)Just-in-Time-CompilerMathematicsLine (geometry)MeasurementView (database)Real numberPresentation of a groupRootCuboidCondition numberTemporal logicBenchmarkArithmetic meanResultantSineFrame problemGame controllerControl flowVibrationInstance (computer science)Survival analysisServer (computing)Similarity (geometry)Interrupt <Informatik>CodeMathematical optimizationStack (abstract data type)MultilaterationPlotterProteinSharewareLevel (video gaming)Lecture/ConferenceComputer animation
08:00
Bus (computing)Computer fileService (economics)Ext functorTape driveLogarithmCompilerRevision controlValue-added networkBenchmarkDuality (mathematics)Mountain passProgrammschleifeComputer programBoolean algebraSoftware testingBenchmarkOperator (mathematics)Type theoryComputer animation
08:58
Bus (computing)Ext functorRegulärer Ausdruck <Textverarbeitung>Service (economics)Computer fileCodeModule (mathematics)BenchmarkSoftware testingFunction (mathematics)UnicodeMetropolitan area networkComputer fontDiscrete element methodVirtual machineMultiplication signSoftwareFunctional (mathematics)InformationEndliche Modelltheorie1 (number)Module (mathematics)Message passingLocal ringBenchmarkWeb 2.0Source codeXML
10:17
Metropolitan area networkArmNormed vector spaceSoftware testingComputer fontBenchmarkWorld Wide Web ConsortiumSicCodeDialectClient (computing)VolumenvisualisierungComputer animationSource code
11:04
Metropolitan area networkGoogolNormed vector spaceMathematical singularitySharewareGamma functionOverhead (computing)Real-time operating systemService (economics)MereologyFreewareElectronic mailing listExtension (kinesiology)Value-added networkRevision controlImplementationJust-in-Time-CompilerCompilerData acquisitionSign (mathematics)Game theoryFocus (optics)RhombusMultiplication signMathematical optimizationElectronic mailing listRight angleCodeFood energyService (economics)Overhead (computing)MultiplicationAuthorizationSubsetCuboidPattern languageRoboticsMathematical analysisPower (physics)Physical systemReal-time operating systemPoint (geometry)Formal languageInsertion lossComputer programRule of inferenceRandom number generationProfil (magazine)Web pageCross-correlationFunctional (mathematics)AverageSlide ruleTheoryAreaBenchmarkMobile WebExtension (kinesiology)Data managementReal numberRun time (program lifecycle phase)Revision controlNumberImplementationInformationDigitizingArithmetic meanStaff (military)InternetworkingCondition numberMatrix (mathematics)MereologyWeb 2.0Just-in-Time-CompilerGraph (mathematics)Core dumpProgrammschleifeGoodness of fitSystem callHeegaard splittingFreewareQuicksortComputer animation
17:42
CompilerCodeJust-in-Time-CompilerValue-added networkMountain passAssembly languageSocial classHydraulic jumpPort scannerLogical constantMetropolitan area networkModule (mathematics)Data Encryption StandardGamma functionPoint (geometry)Binary fileLogical constantHierarchyModule (mathematics)Functional (mathematics)MereologyBoolean algebraComputer programData dictionaryCodeSocial classMultiplication signSystem callVariable (mathematics)AdditionRun time (program lifecycle phase)Real numberAttribute grammarMathematicsObject (grammar)Line (geometry)BitHeuristic1 (number)Revision controlPhase transitionFrame problemQuicksortView (database)Assembly languagePoint (geometry)Loop (music)Correspondence (mathematics)Process (computing)Type theoryCASE <Informatik>String (computer science)BytecodeInheritance (object-oriented programming)Semiconductor memoryPropagatorNumberInstance (computer science)Just-in-Time-CompilerStandard deviationArithmetic meanPole (complex analysis)Set (mathematics)Category of beingEndliche ModelltheorieQuantumStudent's t-testAreaParticle systemCycle (graph theory)DecimalGroup actionExecution unitHelmholtz decompositionInformationTable (information)IntegerMatching (graph theory)Division (mathematics)Computer animation
24:20
OvalLoop (music)Monster groupMathematical singularityExtension (kinesiology)String (computer science)Ideal (ethics)Information systemsSpectrum (functional analysis)Functional (mathematics)CodeWritingStreaming mediaOrder (biology)Video gameVariable (mathematics)ComputerProgrammschleifeClique-widthExtension (kinesiology)Open setMathematical optimizationPlanningEstimatorPoint (geometry)CountingJust-in-Time-CompilerCompilerEmulatorLoop (music)System callString (computer science)XMLLecture/ConferenceComputer animation
27:02
Metropolitan area networkGamma functionOvalLoop (music)String (computer science)Monster groupIdeal (ethics)Mathematical singularityElectronic mailing listOperator (mathematics)Open setTask (computing)SoftwareSocial classElement (mathematics)Just-in-Time-CompilerCodeElectronic mailing listPoint (geometry)Monster groupExtension (kinesiology)Run time (program lifecycle phase)MereologyOperator (mathematics)Object (grammar)Multiplication signSystem callBitFunctional (mathematics)MeasurementData compressionView (database)DigitizingSign (mathematics)Musical ensembleWeb 2.0Logical constantLecture/ConferenceComputer animation
29:11
Boom (sailing)BenchmarkSample (statistics)Software testingHasse diagramSicDuality (mathematics)Metropolitan area networkString (computer science)Inclusion mapLogarithmMP3Dependent and independent variablesMeta elementRobotComa BerenicesComputer fileSharewareRight angleSpacetimePhysical systemMultiplication signGenderMereologySubject indexingInterrupt <Informatik>System callLecture/ConferenceSource codeXMLComputer animation
30:10
Software testingBenchmarkSample (statistics)World Wide Web ConsortiumMathematical singularityPhysical systemRevision controlServer (computing)Multiplication signStructural loadProcess (computing)Roundness (object)CASE <Informatik>Data managementThread (computing)Local ringSource codeComputer animation
30:53
Metropolitan area networkTask (computing)Open setGoogolAngleTabu searchKey (cryptography)Revision controlSoftwareMaxima and minimaMedianBit rateTotal S.A.Connected spaceSample (statistics)Factory (trading post)Newton's law of universal gravitation1 (number)Just-in-Time-CompilerMultiplication signFunctional (mathematics)Computer chessSubject indexingProcess (computing)Form (programming)System callData structureComputer animationXML
32:00
Factory (trading post)Metropolitan area networkMaxima and minimaSample (statistics)Total S.A.SoftwareRevision controlBenchmarkMedianBit rateHeat transferCore dumpProcess (computing)Subject indexingWorkloadWordMultiplication signWeightProfil (magazine)2 (number)Java appletOverhead (computing)XMLComputer animation
33:41
Factory (trading post)Metropolitan area networkTuring testMaxima and minimaGamma functionDependent and independent variablesLevel (video gaming)Graph coloringRight angleSystem callSubject indexingMultiplication signGreatest elementComputer programVolumenvisualisierungXML
34:59
Boss CorporationSample (statistics)Bit rateHeat transferMaxima and minimaMetropolitan area networkRobotPrice indexMeta elementDensity of statesRemote Access ServiceExt functorCore dumpDigital rights managementPhysical systemSoftware engineeringNumbering schemeBasis <Mathematik>Integrated development environmentFile formatSpring (hydrology)String (computer science)Graph coloringCore dumpData managementFigurate numberBenchmarkSocial classFunctional (mathematics)Computer animation
35:57
Boom (sailing)Task (computing)Open setMetropolitan area networkSoftwareEigenvalues and eigenvectorsWeb browserProfil (magazine)MeasurementExtension (kinesiology)Library (computing)BitMultiplication signWindowMereologyDifferent (Kate Ryan album)Set (mathematics)Image resolutionPlanningSystem callSpeicherbereinigungFunctional (mathematics)Term (mathematics)Position operatorProcess (computing)Slide ruleMixed realityProjective planeKeyboard shortcutWordLevel (video gaming)Pairwise comparisonTrailCircleComputer architectureThread (computing)Right angleException handlingJust-in-Time-CompilerDirection (geometry)CodeStack (abstract data type)Standard deviationSampling (statistics)Lecture/ConferenceComputer animation
42:06
Metropolitan area networkJust-in-Time-CompilerProcess (computing)CodeSocial classOverhead (computing)Multiplication signCASE <Informatik>CuboidWeb browserCompilerGoodness of fitComputer programPairwise comparisonLibrary (computing)Computer fileDeclarative programmingType theoryExtension (kinesiology)Revision controlAsynchronous Transfer ModeFunctional (mathematics)CalculationProjective planeTheoryNumberRun time (program lifecycle phase)Arithmetic meanMeasurementSpeicherbereinigungComponent-based software engineeringElectronic mailing listGrass (card game)Interface (computing)Java appletLattice (order)Data dictionaryWebsiteRight angleGame theoryMortality rateObject (grammar)Online helpFood energyMereologyWritingLecture/Conference
47:58
Computer fileMountain passPlot (narrative)Information systemsMetropolitan area networkValue-added networkTotal S.A.Addressing modeComputer configurationElectronic data interchangeProcess (computing)CodecCompilerJava appletTupleCodeDifferent (Kate Ryan album)Library (computing)WritingQuicksort2 (number)Electronic mailing listMultiplication signShape (magazine)Sampling (statistics)Type theoryWeb pageVideo gameBit rateFunctional (mathematics)Strategy gamePairwise comparisonNumberPoint (geometry)MathematicsSemiconductor memoryOverhead (computing)Right angleComputer programPhysical systemBenchmarkResource allocationProfil (magazine)CASE <Informatik>Parameter (computer programming)Ferry CorstenView (database)Event horizonSoftware testingOnline helpSocial classObject (grammar)State of matterAttribute grammarRevision controlJust-in-Time-CompilerCoroutineComputer configurationBuffer solutionLecture/ConferenceComputer animation
52:51
Boom (sailing)Social classCausalityRun time (program lifecycle phase)Software testingMultiplication signFunctional (mathematics)System callDistribution (mathematics)Computer fileLibrary (computing)CodeSemantics (computer science)Patch (Unix)QuicksortLogical constantPressureStandard deviationTerm (mathematics)Endliche ModelltheorieRight angleSet (mathematics)TheoryDynamical systemProcess (computing)Overhead (computing)Cylinder (geometry)DigitizingLecture/ConferenceMeeting/Interview
55:27
Value-added networkMountain passRobotInformation systemsComputer configurationMetropolitan area networkComputer fileCodeFigurate numberSoftware developerRevision controlPoint (geometry)Computer fileGraph coloringRadical (chemistry)Computer animationLecture/Conference
Transcript: English(auto-generated)
00:00
Okay, so this is Antonio Cuny, and I'm Maciej Jokowski, and we're going to talk a bit about Python and PyPy performance today, and who are we? We are PyPy core developers. We also work on a bunch of other projects, the maybe most well-known one is CFFI,
00:21
which is a way to call C, a bunch of other projects. We also do consulting and run a company, Barak Software. So, well, let's start with the usual quote, that premature optimization is the root of all evil, and that you usually end up spending 80% of your time and 20% of your code,
00:45
but it's important to remember that 20% of 1 million lines is still 200,000 lines, so 20% tends to scale with the size 2, and that can cause trouble if your program is not 10 lines of code.
01:01
We're going to talk a bit about how to identify slow plots and how to optimize them. So the first part, I'll talk about the profiling tools and how do you go about saying what's wrong, and the other one, Antonio will talk how to address those problems. So yes, the first part is identifying the slow plots.
01:24
What is performance? Like, who ever tried to optimize something? Most people, good. Well, it's a valid question. It's usually, usually it's a time we spend doing task X.
01:43
So the task X might be serving one HTTP request or computing one protein or doing one of those things, but sometimes it's like number of requests per second, sometimes it's the latency, and the interesting question here is
02:01
what sort of statistical property are you actually interested in? Do you care about the average? Do you care about the worst case scenario? Like, I don't know, if you're developing a car break system, the metric you're optimizing is not the average time it takes to break. It's the worst case scenario it takes to break,
02:20
and this is the metric you're working with. Sometimes you don't care. If you're serving HTTP requests and one in 10,000 requests takes 50 milliseconds more than, well, too bad. Somebody just lost has to press control R, too bad. So it's important to know what you're trying to measure first
02:42
before actually starting to measure. Once you know that, it's cool to have some means to measure stuff. So benchmarks are very good, and if you don't have benchmarks, you might want to just check stuff in production, how it actually works in the real data,
03:03
and see if Python is your problem or your problem is waiting forever for IO, or the fact that you have 700 microservices, each of them talking HTTP to each other all the time. So it's important to be able to quickly determine whether something that you changed
03:24
actually changed stuff or it didn't. It's the same as debugging. If it takes you a week to go around and try again the next thing, then chances of optimizing anything are really, really low. No, thanks.
03:49
So we already got to the point where we know Python is our problem. So looking at top or whatever, we see that Python processes just consume time during CPU.
04:03
One important thing is that systems these days are way too complicated to guess. You cannot just stare at the code and then rewrite it differently and then hope that, oh, but I know how Python works. Do you really? I don't, for one. And you have to actually measure.
04:21
You have to see, okay, I run this. It gives me five seconds. I run that. It gave me four seconds. This is when, but I actually have a tool to measure that. Again, remember about the bottleneck can be small, but can also be really large and distributed all over the place.
04:40
Profilers. Who used C profile here? Almost everybody. Who used plot? No one. So plot is a tool done by Dropbox guys, and I think they run it all the time on Dropbox. So the difference is that C profile is a tool that is event-based.
05:07
So each function is instrumented to report. Like now I'm entering the function, start a timer. Now I'm exiting the function and the timer.
05:22
It's a way to do profiling. Yes, you get time spent in functions. You get tracebacks. You get stuff like that. The problem is that it's a relatively high overhead. On C profile, I think it's like two times roughly. On Python, it's worse. And the problem is that overhead is not evenly spread
05:41
because overhead is cost per function. That's relatively constant. But then if your function does a lot, then the overhead is small. If your function does little, then the overhead is large. And that's very bad. That's especially bad on PyPy. So it skews the results towards putting more emphasis on small functions than large functions.
06:08
And VMprof is a tool that me and Anto started working on this year. This year, last year, sometime like that. And it's like a statistical profiler. So it will sample your code as it's run and see,
06:23
okay, now I'm in this function. Then wait 30 milliseconds. Where am I now? Wait 30 milliseconds. Where am I now? And try to capture those stacks and give you statistical information. So it won't tell you how many times the function got called, but it will tell you how much time statistically likely it's possible to spend there.
06:45
So VMprof itself is the tool that was inspired by GPF tools, which is a similar tool for C. It's based on an interrupt. So it runs on Linux machine about 300 times a second, which is not granular enough if your program runs for 10 milliseconds.
07:08
But if you run a big server that runs for seconds or minutes, you're usually fine. And it samples the C stack and works on C Python and PyPy and possibly different virtual machines in the future.
07:21
So the problem why we didn't just use GPF tools is that C stack does not contain usually much useful information. You'd see, if you look at the C stack of C Python anywhere, you would see that 90% of the time is spent by available frame. Thank you very much.
07:40
That's not a useful piece of information. So we want Python level function. The solution is even worse if you have a just-in-time compiler, which Anto will talk about later. But we want to be able to reconstruct the Python stack from the C stack snapshot. Demo.
08:01
I want to know. So we have, say, a small program that let's pick Python, because everybody loves Python stone.
08:23
Where is Python? Lead Python test here. So Python is a benchmark that's written for Python 1.1, which who remembers Python 1.1? It was a while ago. I don't even know where. It was translated from C to Python, and thus a bunch of operations,
08:46
like nice style of what's capital force, for example. Zero predates booleans in Python. So it's a little data benchmark, but fine, let's run it. So I would run Python minus mvmprof.
09:03
Comes as a module. Just run it like this. Well, that was like... That probably was not enough. Let's run it slightly.
09:21
Stop 100 passes. More. No, even more. Okay. Now it's running long enough to actually have any useful information. So this will just display the statistical information about which function will spend how much time.
09:43
So proc 1 was 37%. This is a little useless because we don't see who actually called proc 1, what proc 1 called, and stuff. So we have a little web tool for that. I'm running this on my local machine because...
10:02
Well, network here. 8000, I think. So we're gonna wait a second, like rerun the benchmark. URL. Let's hope it renders properly.
10:24
So here we see that those like the main code that called main, then called pi stones that called proc 0, which called proc 1, func 2, and proc 8
10:42
that are directly called by proc 0. If we do the same thing on PyPy, then we might want more.
11:12
Then it's relatively similar, except here we can see that first of all, the split is slightly different.
11:21
So you can see that PyPy somehow optimize stuff. Remember, those are percentages. So like all of them goes faster, but some of them got much faster than the others. You can see that PyPy is not optimizing equally all the code. And here we might see that it's a number, and that's JIT code, simply. That piece was compiled to assembler and run.
11:42
I'm working actively on taking this information and making sure it ends in stuff here, so you don't see random numbers. You'll just say, okay, this was 100% JITted. So this is VMProve right now.
12:00
It's something that we fixed segfault as of yesterday, so maybe you should be slightly careful. But it's good for trying and seeing if it actually works these days. Yes, yes, it's the runtime overhead of running VMProve is really, really low.
12:22
It's roughly 5% of the time. So we want to, yes, that's what I have on the next slide. We are looking into a profiler that is possible to deploy the profiler and run profiling all the time.
12:41
It's not a thing that you can only run benchmarks on, somewhere in the isolated core like C profile, but you can actually deploy it and get the data coming to you, which is, again, another point we want to have the real-time monitoring of performance, so you're gonna look in last hour
13:00
or filtered by request or stuff like this. Something that I should probably mention is that we are trying to run this as a service where you would just take the data, upload it somewhere, and we would do some sort of advanced analysis like what happened. Did the code get compiled correctly?
13:20
Are there unusual patterns to look for? Well, this is work for the future. Multifurting right now is at least not tested, which probably means it doesn't work. But the goal is to make it work. The signal handlers and everything are designed to work with multithreading,
13:42
so we just need to look how does it look from the Python side. It might obviously work. Who knows? But it probably doesn't. Yes, so as Maciek said earlier, when you want to optimize something first,
14:05
you have to spot the parts of the code who are low, and then you have to try to make them fast. Well, there are many ways to make Python code faster, and there is of course no time to explore all of them.
14:20
You can write a C extension, you can use Cython, you can use Numba, and you can rewrite your Python code using Trix, which you can find on the internet saying, most of the web pages you find on the web are saying
14:40
that there are these Trix, but most of the time they just don't work. They don't make your code faster, but that's another topic. Or you can use PyPy. PyPy is an alternative implementation, which I, how many of you know what is PyPy? Yes, almost everyone, good. Things changed since 10 Euro Python ago
15:02
when nobody knew what was PyPy. This is a good thing, I think. And yes, PyPy is a Python implementation with a JIT compiler. I'm going to concentrate on this tool in the next part of the talk, because yes, we wrote it, we are biased,
15:21
and it's an interesting tool. The nice thing about comparing to the other tools I'm talking about is that in theory, PyPy gives you most of the wins for free. You don't have to rewrite your code, you don't have to use another tool, and et cetera.
15:41
You just run your Python code and it goes faster. Currently, we released PyPy 2.6, which has nothing to do with the Python language version of Python, which PyPy 2.6 implements Python 2.7.9.
16:04
There is also a release for Python 3K. And if you are interested in knowing more, there are other EuroPython talks during the week. And if you go to speed.pypy.org, you see nice graphs.
16:20
Saying that it's seven times faster than CPython doesn't mean anything. Of course, PyPy is seven times faster than CPython on these benchmarks that we selected. We are not interested in benchmarks in which we are already faster. We try to select benchmarks that show real-world problems and et cetera.
16:40
And on these benchmarks, the average is seven times faster. I say that PyPy contains a JIT, which is the part that makes your code faster. And now I'm briefly going to explain what is a JIT. So suppose you have this piece of Python code,
17:00
which contains function calls and loops and et cetera. And if you interpret your program with CPython or PyPy without the JIT, you see that you will spend some time in the green part, some time in the blue part, a bigger part of your time in the red and orange part and et cetera.
17:25
The idea behind the JIT is that we can optimize the slower spots by compiling them to assembler so that they execute much faster. And then the total time spent in running your program is lower.
17:43
And of course, I cannot go in detail because there is no time. We are a bit in a rush, but how does it work? But the key idea is that first, we compile only the spots which are executed for the most time.
18:04
And then how do we make them fast? Well, we do it by specialized code, basically. So if we see that a certain loop or a function is run with integers, like we have an addition with integers, we produce a specialized version of assembler,
18:23
which knows that these variables are integers. And if later, by chance, we see that we have floats or strings or lists or whatever, we produce another specialized version of your Python code, which is fast on these new types.
18:43
And the idea is that you pre-compute as much information as possible during the JIT compilation phase so that once you have finished, your assembler code does only the interesting things and the ones who are really needed for your code.
19:04
For example, suppose you have this line of code, which is, well, it happens very often in Python programming. Object.foo, it's a method call. And this is a very simplified version of what's happening. So first of all, we look up foo in the dictionary of the object, of the instance.
19:21
Then if it's not found, we look up it in the class. And then if it's not found in the class, we start looking it up in the base class and the base of the base class and et cetera. And finally, when we found it, we execute.
19:41
And if you are interpreting, you have to do T steps again and again and again. Suppose you have this object.foo in a loop that you run it one million times, you have to do a lookup one million times. And so the idea is that in PyPy, you pre-compute the lookup, so that you know which function code corresponds to foo.
20:05
And so you can jump directly to it. But of course, well, Python is dynamic, so things can change because I could change the class of the object. I could add and remove attributes, either on the object or on the class.
20:21
And I can do all sorts of tricks. And these tricks are done in real world programs. So the idea is that we compile the code, pretending that object.class is constant and that the class hierarchy is constant. And so we can do the aligning, do constant propagation and et cetera.
20:46
To make sure that our code is still behaving correctly, we insert a guard, which is a quick runtime check that we do, that our assumptions are still true. If the guard fails, then it means that the assembler code we are executing
21:02
is no longer valid or not valid for this case. And so we bail out and we start interpreting things. Yes, it's going to be slower, but it's better to be slower than be incorrect, of course. And eventually, we compile a new version of the assembler for this new assumption and et cetera.
21:22
So at the end, we get a situation in which all the parts of the code which are executed often are going fast because we have decompiled everything. The hard part is who decides what to specialize on?
21:41
Because for example, as I said, we specialize on the class of the object. But we could specialize on the number of the attributes or if the object starts with O or something like this. Basically, we have to do some heuristics and the PyPy code is written in a way
22:03
which assumes that something is more constant than other things. So we assume that usually classes of objects don't change very often, but the value of the attributes can change. So we promote the class, not the values. And we assume that usually the modules are kind of constant.
22:22
It's not that we add and remove function to the modules at run times. We can do. In that case, yes, we specialize twice or three times. But it's kind of a safe assumption. And sometimes we just have constants in the byte code. So if you write A plus one, one is the constant.
22:42
And then we assume that the constant is constant, yes. And this is usually true. Of course, specializing is a trade-off because if you specialize too much, then we spend most of our time compiling new code and not reusing the one which we already have.
23:00
And so we consume memory and we spend all of the time compiling things. If we don't specialize enough, we produce inefficient code. Because for example, if you don't specialize on the class of the object, we have to do lookup again and again. So yes, it's a trade-off and it's our job to find the best.
23:21
And this brings us to the next point, which is how to write our code in a way that it's friendly to PyPy. Of course, unfortunately, we cannot spend much time on this. It's like a topic which can be one week long, not half an hour. But one point of view is that you should not do the things
23:44
that you have done until now. Usually to optimize a pure Python code without using external tools, you did things like trying to avoid method lookups. So maybe you save the bound method in a variable
24:03
and then call it repeatedly. Or you try to call... Yes, you see this is something that... Where is the mouse? This one? Yes. This is something that Guido wrote a couple of years ago. Be suspicious of function method calls because creating a stack frame is expensive.
24:21
That is completely untrue in PyPy because the function are in line. So basically, if you follow this kind of advice, you write worse Python code because you are trying to optimize it manually. And if you just write nice Python code,
24:41
the PyPy sheet compiler can do it for you. So a couple of points that are general advice of how you should write Python code. Simple is better than complicated, which means that if you write really plain Python code which is self-explained and that you can understand it well,
25:03
well, probably the JIT compiler can do it as well. And it has a better clue on what's going on and it has better chances to optimize it. You should avoid to do string concatenation in the loop because the SNC Python is usually fast because there is one optimization which works only
25:22
if your string is only a reference count of one. We can't do this optimization. So you should avoid it both on C Python and PyPy. You should try to avoid iter tools monster. Sometimes I see pieces of code which are an iter tools call to another iterable
25:43
to a generator which calls another iter tools and et cetera. I have no idea what it's doing and if you have good for you. But this kind of confuses the PyPy JIT. If you just write your nice Python loop with nested loop and explain what you are doing,
26:01
there are chances that the JIT can optimize it as well as iter tools or even better. The other usual advice is to write stuff in C. Well, no, this is good for C Python because the pure Python code in C Python is very low compared to C. But if you write stuff in C,
26:21
then the PyPy JIT cannot know what is happening. So it has to stop optimizing at some point and call to C. If you write everything in pure Python, the PyPy JIT has a better knowledge of what's going on and has better chances to optimize your code
26:41
until the best performance it can. And so if you want to interface with external C code, the best thing to do is use a CFFI, which works both on C Python and PyPy and it's fast and optimized and et cetera. You should avoid the C extensions which are using the C Python C API. We have this compatibility layer,
27:01
but it's really like an emulation of reference counting and other things and it's very slow on PyPy. Yes, it works. It's useful if you want to try your software and et cetera. But if you're using a C extension with using the C Python C API,
27:20
in a part of the code which is important from a personal point of view, it's going to kill all your performance. And then you should try to avoid things which confuse the PyPy JIT in the way I was explaining earlier. For example, we assume that the class is a constant and classes are kind of fixed from some point on,
27:41
so you should avoid creating classes at runtime. If you have a function call and inside the function, you create the class and then you return an object of, after instantiating this class, yes, this is valid Python, it works on PyPy, but then the JIT will specialize on this new class again and again without reaching a fixed point.
28:04
Note that it's not allowed. You can create new classes, for example, during import time because you want to do some meta-programming and et cetera, it is perfectly valid, but if you create one million classes, well, the JIT will create one million assemblers, basically. And for example, this is an example
28:22
of what you should not do to optimize your code because if you try this monster saying, apply operator.attgetter of x and map it to the element of the list and et cetera, it's much, much easier to just write the list compression
28:41
and this is the kind of advice you will find on the web and if you measure it, on C Python, it's exactly the same speed, so it's not true even on C Python and on PyPy, the first one is just a bit slower, so please just write your nice Python code which is understandable and the JIT will remove the overhead for you.
29:03
If you want to know more about PyPy, we are around for the whole week, so just ask something to us. Tomorrow, there is an open space in the A4 room at 18, so it probably will be a Q&A session,
29:21
so just come and ask. And yes, before the, yes, right. Maciej, I want to show you a better demo of gmpro. Yes, so since we have some time, I did, well, we did the experiment yesterday to see a Django example.
29:42
It's a small Django example and it's obviously rigged to show slow parts. So this is the Django example and index does some spurious pickle calls
30:01
and I guess this is the thing I wrote and wanted to see, can I find this stuff using gmprof and make sure things work nicely. So I'm gonna run it first in CPython. So I want to upload to the local host
30:22
and I'm gonna run manage.py runserver and now because it's Django, it starts multiple processes. If you just run it like this, you end up profiling the watchdog process which does nothing.
30:40
Thank you very much. Now reload. And then I'll disable threading just in case. Okay, I'm running the server on port 8001. I'll look at it. Post 8001.
31:02
This is okay. Good, fine. So I'm gonna run a simple AB and I'm not going to listen to you saying that AB is a terrible tool but I want to just send some requests.
31:21
So munch munch munch thinking. Well, it's CPython so I can probably just stop it. 4,000 requests is fine because there is no JIT warmup time or anything like that. And then upload the profiling data. Pretty sure it's trying to load fonts or something.
31:43
Okay, so we have Django doing its job which has a whole stack of billion functions. Handle, run, call call, get response, index. Okay, so I click on index
32:01
and index actually spends only 47% doing the actual job. But as you can see, that dumps here itself spend like 66% of the total thing. So if you kill that, the spurious pickle that I put there, it should be like 50% faster.
32:21
Okay, fine. And now I'm going to run the same thing on PyPy. Start slow and then start warming up and get the request faster.
32:41
This is how JIT works. It takes time to warm up. It really depends on your workload but it's usually something that can run, if you run for five seconds then it's fine. So this was like 600 requests a second
33:04
so about 3,000 requests here was fine. We did measure, yes, this is the longest request. Takes 46 milliseconds and here the 50% takes below one. So you see the warm up is really slow.
33:21
I think so and then I try using Java and then it's really, really slow. So it's not that bad but it's still relatively slow. The overhead here of profiling we checked the other day was like, here was like 660 requests a second without the profiler on. So that's, that's what, 40, below 10% anyway.
33:42
It's like 8%. If I look, it's really not working. I look here, okay, I can't scroll to the right because it doesn't render.
34:03
Maybe no, I don't know. So what I was trying to show is that here we have normal the jungle stack which is five billion levels deep and here it goes run, call, call, response, index,
34:23
somewhere, somewhere there but the index itself spends far less time than on C Python. Like if you click on index, it's like 32% spend on doing index and there's one very, this guy,
34:43
which is make style and like, what's the name of this guy? It's color style. So that made me wonder because my little program that returns okay does not do much coloring as far as I can tell. So we looked at what's going on there
35:04
in Django search, Django core management color and one of the, the guy's name, color style.
35:21
This guy, make style. So make style does something and as you can see, it defines a class within the function body for no actual reason whatsoever as far as I can tell and this thing alone makes this simple benchmark about 30% slower. Like if you remove this and do that,
35:45
we made Django 30% faster by, on this absolutely idiotic benchmark. I agree, but still. Full request pending.
36:08
This browser thing is not really working. Anyway, questions, I suppose.
36:23
Hi, thank you for the presentation. It was very interesting and fun. I have two questions. The first one is about profiler, the both of the questions. Do you plan to support Windows because as far as I understood, you use signals and there are no signals on Windows.
36:41
So yes, we plan to support Windows. We didn't make a precise plan just yet how to do it but we, probably what you can do is you can just run a separate thread then like the C level thread and then just sample the stack. You have to be very, very careful but maybe things are possible. I don't know, like we support Linux right now,
37:02
64-bit Linux. We have plans for support, like OS X is in the process and Windows will look how to support but I don't know yet. If there's high demand, we'll support it.
37:20
For PyPy, PyPy works on Linux, OS X and on Windows and on 32-bit, 64-bit Intel except Windows. It doesn't work on 64-bit Windows. You can run.
37:42
You mean for the profiler or for PyPy itself? So FreeBSD actually works. It's not officially supported but it works. People ported it but generally speaking, it's the same as for Python. Most of the stuff that you are porting
38:04
are C Python standard library and all the calls there. So the effort is very, very similar. We use a few extra calls to say compile assembler, allocate this but as long as you're not supporting awkward architecture like MIPS or something, then it's relatively easy.
38:21
Yeah, the second question. Profiler, I didn't quite understood. Is it possible to make cross profiling between C and Python and if it is possible, how do you do that? So what happens is you capture the entire C stack and if you have C stack
38:41
that includes special entries for Python functions. I can show you later but the idea is that you have all the C calls and then you just throw them away. So having extra C calls present there is very easy. The only problem is then you need your dwarf data to be present. You need to be able to look up the symbols
39:01
but other than that, the support is already there. Thank you. Hi, is it fully compatible with Python 2.7? Can I just replace C Python with PyPy? So for the most part, yes. The only difference is you might want to look into C extensions.
39:23
Not all C extensions work but if your code is pure Python, then it generally works. Is there a way to check, for example, I have a Django project to see if all requirements are compatible? Well, you create virtual ends and try to install it. Like typically what you need to do in stuff like that, you need to replace same MySQL bindings with MySQL CFFI bindings.
39:44
So they're usually, for most stuff that's popular, they're equivalent libraries that do the same thing but instead of being a C extension, use CFFI. So I can't replace my C Python directly. I have to do some additional work.
40:00
Depends. If it's a Django project, then you usually need to slightly change your requirements. Okay, thank you. So I have a question. If you want to write a function and you would have multiple ways to implement it and you don't know which one would be better
40:22
and when you compare the times and the times kind of get different over time because the garbage collector kind of is always present. You cannot disable it, right? In PyPy. Yes. Yes, but then you just do more statistics.
40:41
Okay. The average over time. Just running more time, okay. Yeah. Okay. But the garbage collector in PyPy is already incremental. So if you exclude the JIT warm-up time, which is slow at the beginning, then the GC will be spread almost evenly across your stuff. Okay, because he is assuming that you do a warm-up,
41:01
but then you will get some spikes in the month. Not anymore. We fixed that. Sorry? Not anymore. Oh. We have incremental garbage collector. So it does a little bit of work. Okay, small spikes. I mean not that big. Good luck trying to measure them. Okay, thank you. You're running very quickly into the resolution of your clocks.
41:24
Like, below millisecond, which is like most garbage collection spikes, the resolution is really bad. Okay, does JITing work in a separate thread? Is it one thread, or how does it work? Or does it just stop and compile?
41:41
There is a talk by Armin about how to remove global interpreted lock in Python, but right now PyPy, as it is as you download it, will do everything in one thread. So part of the deal is that the JIT is a tracing JIT. So it does some work by running stuff bit by bit.
42:01
So you can't do it in a separate thread, because you're really running stuff bit by bit. Then like optimizing and emitting assembler, you could in theory do this in background thread, which we never implemented. Okay, so now the function just approaches, compiles, and then runs the compiled version. Yes, for the most part. Slightly more complicated than that, but yes.
42:28
More questions? Larry here. No?
42:53
If I have a project with a Cython extension, would it work with PyPy, or do I need to change it?
43:01
So if you have a new enough Cython, it usually works, but it's slow. So the extension will be slow, because it will go through C Python C API. There is some effort to A, make it faster, and B, I don't know, maybe just like make Cython compile stuff to CFFI bindings, instead of compiling it to C.
43:22
But that work did not materialize just yet. Yes? No, I want to add something, because for example, something that I did a couple of days ago is to try to speed up something which both C Python and PyPy, and I did is the pure mode of Cython.
43:41
So I have the type declaration in a separate file. So on C Python, I compile it with Cython, and it's fast. And on PyPy, I just ignore the Cython part, and it's pure Python, and it works fast. So I think this is like the best way to use Cython. Of course, it doesn't work if you want to interface with C library.
44:01
Then you have to use CFFI, but yes. Did you ever measure the amount of time you spent in PyPy? I mean, for example, you have a process,
44:20
and this process spends some time on doing the work which needs to be done for Python code, and you have your own JIT and all the stuff in PyPy. Did you ever measure these amounts? Yeah, I think you are speaking about the time we spend in the JIT compiler, for example, yes.
44:41
Yes, well, how can I go back to VM prof? Well, the browser. Yes, here. Well, in this case, it doesn't work because it shows 100% interpreting, but for example, VM prof show you how much time
45:02
you spent warming up, which means JIT compiling, how much time you spent in the garbage collector, how much time you spent in the interpreter, and the green box is JITed. So I think this is your question?
45:23
Yes, this is wrong. I don't know. I don't know why it didn't detect the JITed code.
45:42
I'm not sure I understand the question. Like, what do you mean, spend time in PyPy? All this time is spent in PyPy. In PyPy, you do your own work. You need, for example, JIT compilation or garbage collecting, and there is user code,
46:04
which is wrong. Yes, so you mean how much time you spent in the runtime and how much time you spent in user code. Yes, we will measure that. Like, examples of runtime are DIC lookups. DIC lookups are not JIT compiled, but it's usually very useless for users to know that,
46:21
unless you really want to know how much time you spent in big number calculations or stuff like this, then it's usually not very interesting to know how much time you spent in these C functions or how much time you spent in the user code. Like, do you care if your code is calling DIC to lookup and then spending some time in a little helper or not?
46:43
Like, that doesn't matter. Yes, but yes, okay. So you mean it's Python overhead compared to what?
47:02
To the same program written on Java? Yes, or maybe in C++. Well, then the answer is, like, there's no really good answer for that question, because you wouldn't write it the same way. You would use different libraries. You'd do different things. Like, in some places you would not use a list.
47:21
You'd use a dictionary, stuff like this. Like, if you write the things exactly the same or the speed comparison between Python and Java, that doesn't have a good answer as a question. Like, it really depends on the program a lot. Like, our JIT compiler is quite good. The best case scenario is as fast as Java, roughly. But you don't always hit the best case scenario.
47:41
If you write Python code that looks like this, where class style was defined inside, for example, you wouldn't write that code in Java, right? Because you can't do it. Well, you can write it in Java, but it would not be the same inside.
48:00
After compilation, of course. Yes, so Python lets you write... If you ask, like, what's the... If you take Java code and write exactly code like Java in Python, then Python will be roughly at the same speed with not much difference. If you write code that does tuple of s.upper
48:23
for s in, like, I don't know, keyword arguments of the function.keys, which you can't really write in Java very easily, then it probably will be slower than the equivalent Java where you actually have to iterate by hand over them, for example. So it really depends what sort of style you follow.
48:43
Python lets you do things that are expensive and they're not hard to write, and we are trying to optimize them, but we don't always do as good job. As if you were writing in C, you would never allocate memory yourself, because it's a real pain to deal with. Because then you have to remember where to free,
49:00
how to exit the function, but in Python it's very easy to allocate memory all the time. I've seen the routine in, like, comparisons of Python in C, and in PyPy they used a list and append, and in C they used, like, pre-built buffer of 1000, because allocating list that you have to resize is just too hard. But this is apples and oranges, again.
49:23
Well, I hope that answers your question. Yeah, thanks. Any other questions? Thank you for the talk. My question is, will type hints help PyPy?
49:41
No, the short answer is no. So the thing is, like, if you do JIT compilation, you know all the types anyway, because the types are what's actually there. So you actually know that the longer answer is that type hints are not precise enough for the sort of stuff that PyPy does.
50:01
Like, PyPy would not only specialize on the class name, it would specialize on what shape of the class is there. So, for example, what sort of attributes are present on this precise object. And this is something that you can't express with type hints. So type hints are actually not even not helping.
50:20
You can't do the same things as we do in the JIT. We can do whatever type hints allow you to do, and we do that, and we do more. So I have a question for the VM prof. It's very interesting. Are you using it to determine where PyPy is relatively not that much better than CPython?
50:41
So you did a lot of switching back and forth between the two cases in your web page. Do you have an integrated view that tells you in this PyStone benchmark, for example, PyPy is particularly good in proc one, but it's relatively bad in proc three. Do you have that? Do you intend to have that? Because I would presume it helps you find test cases
51:02
where PyPy can be improved for a given program. That's a very good point. I didn't think about it before, but it's probably something very, very useful, like to be able to compare not even two interpreters, but also two different libraries, for example, or two different setups for a function,
51:21
and say, okay, if I do this, what actually happens to the profile? That sounds like a good idea. I think I checked about six months ago,
51:42
and PyPy is incompatible with gevent. Has that changed, and is it gonna change? Yeah, I think the new release of gevent will support PyPy. I mean, the trunk version already does.
52:04
Hello, just a quick question about VM prof. Is it possible to customize the sampling rate for sampling? So yes, there is an option to customize the sampling rate. The problem is that Linux signals
52:22
won't allow you more than the system clock, which is around 300 hertz. I think it's, I don't remember, but it's around that number that you can't go lower without changing strategy completely to something like threads, which we might need to do for Windows anyway. Does it increase the overhead linearly?
52:43
Yes, obviously. It would if you sample more often than, you are at 300 hertz, you have 30 milliseconds of time in between, and in this time you're doing your job, and also the sampling. Yeah, okay, thanks.
53:04
Hello, and again, thanks for the talk. And I have the question about the PyPy. So you showed the example that making a class inside the function is bad. It obviously is pretty bad in CPython as well, causes some overhead. But my question is, let's say there's some testing library
53:21
that mocks the classes. So how do you deal with monkey patch and stuff? Maybe in short. Depends on the setup a lot, but if you just mock the stuff, and do you mock the stuff for each, well, I mean, we are talking about stuff that's called millions of times. If you mock it for each call of the test function,
53:42
how many test functions do you have? Like 500, that's definitely not a problem at all. I mean, those test functions won't be jitted anyway. No, I mean, what you were saying is, your assumption is that the class is a constant. So obviously when I'm monkey patch, I change this constant at the run time. Yes, but do you do it a million times, or do you do it a couple times?
54:01
I'm going to do it several times around. No, that's completely fine. So it's a kind of floating assumption. Yeah, it's only if you're really having this sort of stuff, like in this example, that this make style is code for every request, and I'm doing 10,000 of those. Thanks, that's why it clarifies for me.
54:23
One more note, it's not that the code is incorrect. I mean, the Python semantics is always preserved. If you do this kind of two dynamic style, maybe you have bad performance, but it still works.
54:42
Hi, I was just wondering, is there a linter or something that I can run to tell me about that, that the class is a bad idea in here? No, it would be nice to have, but no.
55:04
Sorry? Yes, yes, yes, please do.
55:22
This file is from Django, right? Django, yes. This is from Django. Django has code in it that says, if we're running on Pocket PC, don't do something. People try to run Django on their Pocket PC on Python?
55:47
That file is dealing with, like, reading the terminal color palette or something. I'm sure Django does that at all, so maybe it shouldn't do it on Pocket PC, I don't know. In my opinion, it shouldn't be doing that at all.
56:05
Are you planning to add support for Python 3.4 and 3.5 later in PyPy anytime soon? I didn't understand any of the questions. Are you planning to have the PyPy 3K support Python version 3.4 and 3.5?
56:21
Yes, eventually. The problem is that there are not many PyPy developers working on it, and so the development is low, but yes. More questions?
56:43
No, so thank you very much.