Production-time Profiling for Python
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 490 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/47285 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSDEM 2020420 / 490
4
7
9
10
14
15
16
25
26
29
31
33
34
35
37
40
41
42
43
45
46
47
50
51
52
53
54
58
60
64
65
66
67
70
71
72
74
75
76
77
78
82
83
84
86
89
90
93
94
95
96
98
100
101
105
106
109
110
116
118
123
124
130
135
137
141
142
144
146
151
154
157
159
164
166
167
169
172
174
178
182
184
185
186
187
189
190
191
192
193
194
195
200
202
203
204
205
206
207
208
211
212
214
218
222
225
228
230
232
233
235
236
240
242
244
249
250
251
253
254
258
261
262
266
267
268
271
273
274
275
278
280
281
282
283
284
285
286
288
289
290
291
293
295
296
297
298
301
302
303
305
306
307
310
311
315
317
318
319
328
333
350
353
354
356
359
360
361
370
372
373
374
375
379
380
381
383
385
386
387
388
391
393
394
395
397
398
399
401
409
410
411
414
420
421
422
423
424
425
427
429
430
434
438
439
444
449
450
454
457
458
459
460
461
464
465
466
468
469
470
471
472
480
484
486
487
489
490
00:00
Computer animation
00:34
SoftwareFreewareProduct (business)Computer animation
00:57
Statement (computer science)Control flowFaktorenanalyseMutual informationInformationEvent horizonSheaf (mathematics)Inclusion mapBefehlsprozessorContent (media)Semiconductor memoryProduct (business)Profil (magazine)Physical systemComputer animation
01:20
Ray tracingProfil (magazine)Computer programmingClique-widthMultiplication signFunction (mathematics)Cartesian coordinate systemCodeBefehlsprozessorLine (geometry)Semiconductor memoryFunctional (mathematics)Tracing (software)Decision theoryProduct (business)Resource allocationRectangleComputer animation
02:59
Type theoryFunction (mathematics)StatisticsDeterminismSample (statistics)MetreSystem callOverhead (computing)File formatComputer programBefehlsprozessorException handlingFunctional (mathematics)Physical systemOperator (mathematics)Multiplication signComputer programmingDifferent (Kate Ryan album)BefehlsprozessorFunction (mathematics)System callProfil (magazine)Type theoryWeb 2.0Computer fileProduct (business)CodeFrame problemOverhead (computing)CASE <Informatik>State of matterLine (geometry)Traffic reporting2 (number)NumberMetreDeterminismAbsolute valueException handlingRepresentation (politics)Server (computing)Term (mathematics)Control flowSampling (statistics)Programming languageComputer animation
07:36
Thread (computing)BefehlsprozessorStack (abstract data type)Multiplication signThread (computing)Functional (mathematics)Real-time operating systemKey (cryptography)Endliche ModelltheorieBefehlsprozessorComputer programmingRepresentation (politics)Associative propertyNumberComputer animation
09:06
Functional (mathematics)Total S.A.Thread (computing)Computer programmingMultiplication signWeight functionFunction (mathematics)2 (number)BefehlsprozessorRootWeightoutputBitNumberComputer animation
10:08
Limit (category theory)Exception handlingBefehlsprozessorException handlingProduct (business)Computer programmingFunctional (mathematics)Game controllerInformationEquivalence relationBefehlsprozessorBlock (periodic table)BitPhysical system2 (number)Computer animation
11:24
Stack (abstract data type)Module (mathematics)InformationComputer animation
11:41
Resource allocationProfil (magazine)Module (mathematics)Object (grammar)CodeSemiconductor memoryInformationLine (geometry)Functional (mathematics)Computer animation
12:29
Memory managementResource allocationComputer programmingRepresentation (politics)AlgorithmSampling (statistics)Multiplication signResultantCartesian coordinate systemProduct (business)Physical systemPattern languageObject (grammar)Computer animation
13:20
Resource allocationCountingSemiconductor memoryTracing (software)Resource allocationStack (abstract data type)Multiplication signComputer programmingNumberFunctional (mathematics)ResultantTrailClique-widthObject (grammar)BitTerm (mathematics)Computer animation
14:01
NumberThread (computing)InformationBefehlsprozessorInformationFunctional (mathematics)Thread (computing)Line (geometry)Limit (category theory)Multiplication signStack (abstract data type)Type theoryLevel (video gaming)Social classNumberContext awarenessComputer animation
14:52
Intercept theoremProfil (magazine)BitThread (computing)Multiplication signInformationContent (media)Object (grammar)Computer programmingCartesian coordinate systemComputer animation
16:17
Total S.A.Computer programmingThread (computing)InformationMultiplication signFlow separationDifferent (Kate Ryan album)NumberTotal S.A.Computer animation
16:54
Standard deviationFocus (optics)File formatFunction (mathematics)SpacetimeFile formatProgramming languageProfil (magazine)Computer programmingWordSpacetimeString (computer science)Functional (mathematics)Endliche ModelltheorieData storage deviceMultiplication signTracing (software)Computer animation
19:38
Visualization (computer graphics)WritingComputer fileClique-widthMultiplication signBitNumberFrame problemRectangleUniform resource locatorComputer animation
20:31
Open sourceMereologyComputer scienceOpen sourceComputer animation
21:09
TwitterComputer animation
21:57
FacebookPoint cloudOpen source
Transcript: English(auto-generated)
00:11
All right, next talk. Please welcome Julien, who's gonna tell us about profiling in Python. Thank you.
00:42
Okay, so, sorry. So, today I'm gonna talk about production profiling for Python. So, if you don't know me, I'm Julien. I worked at Daedalog for almost a year now, and I've been doing free software stuff for like 20 years now. This is a boring statement. Let's say that anything that I say today
01:01
might change in the future, but I have to say that to you. So, first I'm going to talk about what profiling is and what you need it. Then I'll talk about how we built a Daedalog profiler for production system that we use internally, like we're going to provide to our customers in the future.
01:21
So, the first thing to understand is what profiling is and what you need it. If you never use profiling, it's a simple answer, but it tries to solve some of the questions. It's like, if I were to ask you in a program which function or which line of code is going to be the most costly,
01:40
the most time you're going to spend in your program running and doing that function of that line of code, it will be probably pretty hard for you to guess or to give me an answer, an accurate answer. It's really hard to know when you run programs in production which endpoint is going to spend a lot of time on the CPU, the one line that is going to look at the most memory
02:03
in your programs, you might guess, but the more complex your application is going to be, the more complicated it's going to be to answer that question. So, the goal of profiling is to actually get inside your program and get all of this data and extract it and be able to say, okay, this is a stack trace that I see most of them,
02:21
this is a stack trace, so when I talk about stack trace, it's like all of the functions that are being called one after the other, and being able to say, okay, this is what my program is doing and this is where I need to optimize it to make it faster, because this is the end goal, is to be able to see what your program is doing and to make better decision about
02:40
where to spend your time to optimize it. This is typically the kind of output that you will see if you do profiling on a Python program or any kind of program for that matter, where you see stack traces and when the width of the rectangle is like the time spent on CPU or allocation memory or anything like that. There are two types of profiling.
03:02
The first one is deterministic profiling, so if you ever use C profile, the Python profiler, which is built in since like a long time in Python, this is what it does. And there's another type of profiling, which is what we're going to talk about today, is like the statistical profiling, where you do things in a different way,
03:21
which has its own perks. So deterministic profiling, if you never use C profile or if you don't know it works, it's pretty easy. It actually hooks itself inside Python. It says to C Python, each time you're going to call and execute a function, I want you to tell me, and I will like note down the time it is, and when you finish to execute the function,
03:40
you just call me again and I will note down the time, and then I will compute the difference between the two times and I will know that this function has run for that many seconds. So that's pretty cool. It's pretty easy to understand, but it has a few issues, which is like, you only know the whole time. That means you know how much seconds your function
04:02
has taken, but you don't know what it did. Like, was it waiting for IO? Was it doing CPU stuff? Was it allocating memory? Do you have no clue? You just know that it's been two seconds executing the function. The way it works is that it hooks itself on functions, which is the granularity that you will get from the output.
04:21
So if your program is just one giant function, where you will have nothing interesting in terms of information, and if your program is fast enough function or very small function calling each other, it's going to have so much overhead that your program is going to be 10 times or 20 times slower when it runs. So it's pretty hard to use C profile in production system for that reason,
04:42
because the most functions you have in your program, the slower it's going to be. C profile in particular has this one program where in our case it was an issue where it exports its data, so in a custom file format, which is something like PQL if I remember correctly, which is not something that is really easy to use
05:01
in any other programming language. So for all of those reasons, it's impossible to use something like deterministic profiling and C profiles in production system. You can't afford to have your web server to run 20 times slower just because you're doing profiling. The other way of doing profiling is different. It's like you are going to run your program normally,
05:21
you're not going to change anything, but every now and then, like every 10 millisecond or so, you're going to interrupt your program and to look at what it's doing right now. So what it's going to tell you if you do that often enough, you're going to have statistically a good representation of what your program is doing usually. So it's not like an absolute truth
05:40
contrary to something like C profile, which really meters every call. You're just going to, in that case, statistically over a few hours or a few days of your program running, you're going to have a number of samples that's going to show you which number of time you see each stack trace, and you're going to have the kind of people that I showed you earlier, which is roughly statistically accurate.
06:04
The upside of doing that is that you can actually measure a lot of different things like the wall time, but also the CPU time, which tells you if you spend time just doing IO or actually using the CPU and doing things like computing results. It has a very low overhead. The way it works is that it leaves your program
06:21
doing what it does usually, and does not try to wrap the functions or interrupt them or whatever. It just now and then executes something in the background, saying, okay, what is it doing right now? And that's it. You were able, thanks to Python, at least to go down to the line of code that is being executed, so you're not just talking about functions,
06:41
but you're able to see which line of code in particular is going to be executed. In some languages, not Python, unfortunately, you can go down to the opcode, though the precise operation in the line, that's not possible in Python yet, unless you speed your code a lot. And you can have over things, I'll talk about that later,
07:00
see the report against the exception running in your program, while it's being raised, and you're able to see them and to see the control flow in your program. So what I'm going to talk after is what we did at Datadog implementing this. Before doing that, I spent a lot of time checking the state of the art of the dozen of Python profiler out there.
07:21
None of them does that. A lot of them actually focus on their output, doing fancy frame chart, or things like that, which is not something I had to solve. I was only interested into getting the data and exporting them into a proper data format. So that's what I'm going to talk about just now. So the first thing I had to solve was how did you get the CPU and world time information?
07:44
So the way I solved that is in a thread, so you run a parallel thread in your Python program, but it sleeps most of the time, let's say every 10 milliseconds. You wake up, and you're going to use the sys dot colon thread, which is like a private function
08:00
that has been there forever and that everybody uses, so it's not really private. But what this function returns is a list, a dictionary where the key is a thread and the value is a stack trace, basically. So if you do that every 10 milliseconds, regularly, you're able to each time see which stack is being executed on each thread,
08:21
and you're able to construct a model where you know in real time what your program is doing every 10 milliseconds. When you do that, you can actually use, so that's only Linux-specific for this part, function that is pthread gets CPU clock ID, which basically returns you a special clock
08:41
in Linux which measures the CPU time that each thread is consuming. So every 10 milliseconds, you will be able to get a stack trace and actually associate that stack trace with a number of nanoseconds or milliseconds of CPU that has been used by the thread. So at the end, you get a pretty good representation on each of your threads, what they do,
09:02
and how much CPU they used. You can actually also get the whole time because you can use a regular clock. So you get this kind of input where you have the regular, so this is an example of a Python program where you're being profiled for one minute with four threads, that's why you may not see, but there's four minutes on the top, root span on the top.
09:22
There's four minutes of whole time in total because you've got four threads running for one minute, and you have the detail of which function has been called in which thread, and you have the same data for the CPU time, which is way less, there's only three seconds or five seconds of CPU time being used. And you see, the output looks a bit the same, you see the same stack traces,
09:40
but you see that a lot of the functions are like weight functions that are just doing nothing. So if you were using something like C profile, you would see a lot of time being spent in the weight function, which is not very interesting because it's just a weight function. Whereas if you take a look at the CPU time, using the data you got for the T thread CPU clock ID, you actually get the number of nanoseconds
10:02
being used on the CPU, and you can see here, for example, the weight function, we don't see them in the charts below. So as I was saying, it's a way better way to do profiling in production system because it has a very low overhead. So for example, on my laptop,
10:20
if I am to profile a Python program with 10 threads, which have each one executing 30 function long stack traces, I can poll and generate 100 samples per second for only 1% of CPU. So it's already a pretty good thing to get,
10:41
and it actually can get a bit cheaper because the way it works, you sleep every 10 milliseconds, but you can actually increase that amount of sleeping. So if you want to have less impact of your performance, you can dynamically adapt your sleeping time. Like I was saying, you're able to get, so you need to do a bit of C for that because it's not excellent in Python,
11:02
but you're actually able to get the profiling exception information. So when you use a block like try accept, and it goes back to a few functions before being called, you can see that in Python, it's equivalent of sys.xinfo. You can catch that and see the control of the program.
11:22
And you can limit the resources, I was saying you can basically sleep more often if you need. Right now, the only downside is that we don't do C profiling. So if you have a siphon module or C module, you are not able to see any information below the Python stack. That's something I'd like to solve in the next release.
11:43
We also do memory profiling. So for memory profiling, it's way easier because I'd say the most of the work has been done already in Python. So there's a Python module since 3.4, which is called trace-marlock. So what trace-marlock does is that it replaces
12:00
the malloc function used by Python to allocate the memory and all the objects that you use in Python by its own wrapper, which allocates the memory, but also make a trace of the allocation. And trace-marlock is able to export to you the information about which line of code allocated a memory, and like many megabytes,
12:23
and it has been called. So it's pretty easy to get this kind of information without doing too much work. The primary trace-marlock is that it wraps and it counts every object allocation that you do, which makes your program very slow again. So that's not something you can do on production system if you have a program
12:43
that does a lot of allocation and gets four times or 10 times slower just because you are tracing the allocation. It's a problem. So the trick here is to also do sampling and doing statistical aggregation of the results. So the algorithm is pretty simple. Like you sleep most of the time, and once in a while, you just wake up,
13:01
activate or deactivate trace-marlock to get the samples or not of memory allocation. If you do that like 5% of the time, it's usually okay. You only get 5% of the allocation being tracked by trace-marlock, but over time, if you do that for a few minutes, a few hours, you get a pretty good representation of your patterns in your application.
13:22
So it gives you two results, as I was saying, the number of allocations that each of your functions does and the size. For the two charts, it looks a bit of the same, but the width is going to change based on if the allocation is going to allocate bigger objects or not. But this is the kind of result that you might get at the end with stack traces
13:41
and the number of megabytes that you're going to have allocated. So pretty easy, again, to debug your program to see what's costly in terms of memory. So that's only count the allocation that you do, not the size of the memory you used. That's only the number of times you allocate memory. It doesn't track when you free it or if you don't use the object anymore.
14:03
So right now, the limitation of trace-marlock, like I was saying, is that the overhead, it's really slow, really, really slow. I'm not sure right now how fast you can, I mean, if you can improve it, you make it faster. There's no thread information, so the way you track things, it's pretty simple. You don't have any context for which thread
14:21
did the allocation, so you can't map that to any threads. And it's also limited in the information it reports. For the CPU and wall time before, it's pretty easy to get the function name because it's reported in the stack. In that case, trace-marlock just gives you the line number and the file, so you don't know which function or which class
14:41
you are actually trying to allocate. Even which type of thing you are locating, it's pretty light in that regard. So that's something I'd love to improve in the future. Another thing where we do a bit of profiling is threading. So usually when you do multi-threading in Python,
15:01
so a lot of people don't do multi-threading because of the global log and how it makes things slow, et cetera. But you still need some time to have threads running. When you do application with a lot of threads, you need to synchronize over something, and usually you use a lock, which is like the most basic object you can get in the threading world in Python.
15:23
And that lock, when you have a lot of threads trying to acquire it, you might have contention, and it might make your program slow. There's nothing in Python to basically to give you that kind of information about, is my thread blocking because it's trying to acquire a lock or not?
15:41
Is it waiting for too long? Is it waiting forever? What is my contention issue here? It's pretty hard to know. So the way we solve that is we basically wrap the threading.lock instances, so each time you thread a new lock, you actually get a lock which is not a real lock. I mean, it's a real lock, but under the rules, we intercept it
16:01
to see when your program is going to acquire it and when it's going to release it. That makes it possible to actually get the information about how much time they spent trying to acquire the lock or how much time they spent acquiring the lock, like the lock being held by one of the threads. So you get a lot of information. For example, you get the total lock acquirer wait time.
16:23
So this is the number of milliseconds, usually, not that long, that your program is trying to wait, is going to wait to acquire that lock. So with this kind of information, you are able to see that you may need more lock or to split your program in different parts, to not use only one lock, but several locks, et cetera.
16:41
You can get all of this information to be able to redesign your program, which is not always easy and possible, but sometimes it is, and you get all of that with the different threads that try to acquire the different locks. So once we build all of that, we had to export all the data to use them,
17:01
which is its own challenge. So like I was saying, C profile has its own custom format, it's not very interesting, not useful for us. There is one format which is pretty used in the world, which is the Calgrant format. So it comes from the Valgrant program, which does profiling for C.
17:21
And it has its own text-based format that you can use to export data. The problem is that it's text, basically, so it's not really, it's easy to consume, but it's not really efficient in any way. And so that was a good solution for us.
17:40
Like I was saying, a lot of Python profiler then directly export HTML, so that never gave us any clue on how to export to something that you can use for any of our program. And typically from pprof, so if you don't know pprof, it's a lot of things. The name of the profiler that is in the Go programming language
18:02
is also the name of the tool they use to display the data, and it has the name of the format, so it's like the same word for four different things. So the pprof format is pretty cool because that's what you can use for if you do Go and Go profiling, you can export all of your data in pprof format. So it's a format based on protobuf, so it makes it pretty easy to understand,
18:23
there's a schema with all the fields, the key material, et cetera. It's easy to use, and protobuf is easy to use in Python too. It's very fast to serialize, so it would be a shame to try to be fast gathering all the data about your Python program and then being very slow about exporting it,
18:40
so that's not a problem. The way the pprof format works is not going to export everything that you gather, like every sample, every timestamp, et cetera. It's going to aggregate all of that, so you're going to compute the sum, the average, the minimum, whatever you need, and export all of that with the stack traces in your format at the end. And it's pretty space efficient,
19:01
so we use a pretty simple model of a string pool where all your string, because you have a lot of repetition usually, like all the functions you're going to see are always the same, the stack traces are going to be most of the time the same, so all of that is pretty well, I'd say, organized in the schema, and it's just zipped by default, so pretty fast to generate, pretty slow,
19:22
pretty small to store. For example, on average, we use 20 kilobytes of storage to store one minute of profiling for one Python process, so pretty easy to store, even for a few hours or a few days. So pprof is also the name of the tool
19:42
that you can start pretty easily if you know a bit of Go, but it's not that hard, so it's able to generate this kind of frame charts, which you see on the top left, where you see all the stack traces, and the width is going to be either the size of the location that you do in memory, the time you spend on CPU,
20:00
et cetera, et cetera, depending on the status that you gathered. You can also use this kind of chart on the right, where each rectangle is going to be one function, and everything is going to be connected by the number of times they call each other, et cetera, so it's pretty hard to read when you're not used to it, but I promise if you spend a few minutes on it,
20:20
in pprof, it's pretty easy to gather and to understand. So it's easy to use. You can open any pprof file with it, being Go or Prifon or anything. So what we did at Datalog is an open source library, which is not yet public, just because for them it's too early, but I'm working on it.
20:41
It's going to be integrated into our tracer library, which is already open source. So the profiling part is going to be integrated into that, and if you want to use it and you're not using Datalog, it's fine because it also supports exporting the data to a pprof file, files, with an S,
21:00
so you'll be able to use it directly and to export all the data in pprof and use the tool I showed just before. So as I said, it's not released yet, so if you want to know when it's released, you can follow me on Twitter or anything, and I'll be able to announce it. It should be in a few weeks, no promise, but I hope so, and that's it.
21:39
We have four minutes for questions.