We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Production-time Profiling for Python

00:00

Formal Metadata

Title
Production-time Profiling for Python
Title of Series
Number of Parts
490
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Learn how to scrutinize your Python application in order to optimize them and make them run faster. Getting inside knowledge of how your Python application runs is critical in order to achieve the best performance. Profiling is a mean to achieve this: by gathering all the runtime information available about the execution of your program, you might be able to understand how to optimize it. However, profiling running code in production might be a real challenge as it requires the profiler to be noninvasive and having low overhead. Therefore, to profile production services, statistical profiling is the favorite analysis method. By regularly checking your program activity, you’ll be able to find production code bottlenecks down to the line of code. Profiling services that are running with real workload makes sure that you are collecting valuable data and that you are not guessing what the performance barrier might be. This talk explains how it’s possible to build a statistical profiler that collects information about CPU time usage, memory allocation, and other information — all that while respecting the need for low overhead, data export format, and granularity. We’ll dig into some of the operating systems and CPython internals to understand how to build the best profiler possible.
33
35
Thumbnail
23:38
52
Thumbnail
30:38
53
Thumbnail
16:18
65
71
Thumbnail
14:24
72
Thumbnail
18:02
75
Thumbnail
19:35
101
Thumbnail
12:59
106
123
Thumbnail
25:58
146
Thumbnail
47:36
157
Thumbnail
51:32
166
172
Thumbnail
22:49
182
Thumbnail
25:44
186
Thumbnail
40:18
190
195
225
Thumbnail
23:41
273
281
284
Thumbnail
09:08
285
289
Thumbnail
26:03
290
297
Thumbnail
19:29
328
Thumbnail
24:11
379
Thumbnail
20:10
385
Thumbnail
28:37
393
Thumbnail
09:10
430
438
Computer animation
SoftwareFreewareProduct (business)Computer animation
Statement (computer science)Control flowFaktorenanalyseMutual informationInformationEvent horizonSheaf (mathematics)Inclusion mapBefehlsprozessorContent (media)Semiconductor memoryProduct (business)Profil (magazine)Physical systemComputer animation
Ray tracingProfil (magazine)Computer programmingClique-widthMultiplication signFunction (mathematics)Cartesian coordinate systemCodeBefehlsprozessorLine (geometry)Semiconductor memoryFunctional (mathematics)Tracing (software)Decision theoryProduct (business)Resource allocationRectangleComputer animation
Type theoryFunction (mathematics)StatisticsDeterminismSample (statistics)MetreSystem callOverhead (computing)File formatComputer programBefehlsprozessorException handlingFunctional (mathematics)Physical systemOperator (mathematics)Multiplication signComputer programmingDifferent (Kate Ryan album)BefehlsprozessorFunction (mathematics)System callProfil (magazine)Type theoryWeb 2.0Computer fileProduct (business)CodeFrame problemOverhead (computing)CASE <Informatik>State of matterLine (geometry)Traffic reporting2 (number)NumberMetreDeterminismAbsolute valueException handlingRepresentation (politics)Server (computing)Term (mathematics)Control flowSampling (statistics)Programming languageComputer animation
Thread (computing)BefehlsprozessorStack (abstract data type)Multiplication signThread (computing)Functional (mathematics)Real-time operating systemKey (cryptography)Endliche ModelltheorieBefehlsprozessorComputer programmingRepresentation (politics)Associative propertyNumberComputer animation
Functional (mathematics)Total S.A.Thread (computing)Computer programmingMultiplication signWeight functionFunction (mathematics)2 (number)BefehlsprozessorRootWeightoutputBitNumberComputer animation
Limit (category theory)Exception handlingBefehlsprozessorException handlingProduct (business)Computer programmingFunctional (mathematics)Game controllerInformationEquivalence relationBefehlsprozessorBlock (periodic table)BitPhysical system2 (number)Computer animation
Stack (abstract data type)Module (mathematics)InformationComputer animation
Resource allocationProfil (magazine)Module (mathematics)Object (grammar)CodeSemiconductor memoryInformationLine (geometry)Functional (mathematics)Computer animation
Memory managementResource allocationComputer programmingRepresentation (politics)AlgorithmSampling (statistics)Multiplication signResultantCartesian coordinate systemProduct (business)Physical systemPattern languageObject (grammar)Computer animation
Resource allocationCountingSemiconductor memoryTracing (software)Resource allocationStack (abstract data type)Multiplication signComputer programmingNumberFunctional (mathematics)ResultantTrailClique-widthObject (grammar)BitTerm (mathematics)Computer animation
NumberThread (computing)InformationBefehlsprozessorInformationFunctional (mathematics)Thread (computing)Line (geometry)Limit (category theory)Multiplication signStack (abstract data type)Type theoryLevel (video gaming)Social classNumberContext awarenessComputer animation
Intercept theoremProfil (magazine)BitThread (computing)Multiplication signInformationContent (media)Object (grammar)Computer programmingCartesian coordinate systemComputer animation
Total S.A.Computer programmingThread (computing)InformationMultiplication signFlow separationDifferent (Kate Ryan album)NumberTotal S.A.Computer animation
Standard deviationFocus (optics)File formatFunction (mathematics)SpacetimeFile formatProgramming languageProfil (magazine)Computer programmingWordSpacetimeString (computer science)Functional (mathematics)Endliche ModelltheorieData storage deviceMultiplication signTracing (software)Computer animation
Visualization (computer graphics)WritingComputer fileClique-widthMultiplication signBitNumberFrame problemRectangleUniform resource locatorComputer animation
Open sourceMereologyComputer scienceOpen sourceComputer animation
TwitterComputer animation
FacebookPoint cloudOpen source
Transcript: English(auto-generated)
All right, next talk. Please welcome Julien, who's gonna tell us about profiling in Python. Thank you.
Okay, so, sorry. So, today I'm gonna talk about production profiling for Python. So, if you don't know me, I'm Julien. I worked at Daedalog for almost a year now, and I've been doing free software stuff for like 20 years now. This is a boring statement. Let's say that anything that I say today
might change in the future, but I have to say that to you. So, first I'm going to talk about what profiling is and what you need it. Then I'll talk about how we built a Daedalog profiler for production system that we use internally, like we're going to provide to our customers in the future.
So, the first thing to understand is what profiling is and what you need it. If you never use profiling, it's a simple answer, but it tries to solve some of the questions. It's like, if I were to ask you in a program which function or which line of code is going to be the most costly,
the most time you're going to spend in your program running and doing that function of that line of code, it will be probably pretty hard for you to guess or to give me an answer, an accurate answer. It's really hard to know when you run programs in production which endpoint is going to spend a lot of time on the CPU, the one line that is going to look at the most memory
in your programs, you might guess, but the more complex your application is going to be, the more complicated it's going to be to answer that question. So, the goal of profiling is to actually get inside your program and get all of this data and extract it and be able to say, okay, this is a stack trace that I see most of them,
this is a stack trace, so when I talk about stack trace, it's like all of the functions that are being called one after the other, and being able to say, okay, this is what my program is doing and this is where I need to optimize it to make it faster, because this is the end goal, is to be able to see what your program is doing and to make better decision about
where to spend your time to optimize it. This is typically the kind of output that you will see if you do profiling on a Python program or any kind of program for that matter, where you see stack traces and when the width of the rectangle is like the time spent on CPU or allocation memory or anything like that. There are two types of profiling.
The first one is deterministic profiling, so if you ever use C profile, the Python profiler, which is built in since like a long time in Python, this is what it does. And there's another type of profiling, which is what we're going to talk about today, is like the statistical profiling, where you do things in a different way,
which has its own perks. So deterministic profiling, if you never use C profile or if you don't know it works, it's pretty easy. It actually hooks itself inside Python. It says to C Python, each time you're going to call and execute a function, I want you to tell me, and I will like note down the time it is, and when you finish to execute the function,
you just call me again and I will note down the time, and then I will compute the difference between the two times and I will know that this function has run for that many seconds. So that's pretty cool. It's pretty easy to understand, but it has a few issues, which is like, you only know the whole time. That means you know how much seconds your function
has taken, but you don't know what it did. Like, was it waiting for IO? Was it doing CPU stuff? Was it allocating memory? Do you have no clue? You just know that it's been two seconds executing the function. The way it works is that it hooks itself on functions, which is the granularity that you will get from the output.
So if your program is just one giant function, where you will have nothing interesting in terms of information, and if your program is fast enough function or very small function calling each other, it's going to have so much overhead that your program is going to be 10 times or 20 times slower when it runs. So it's pretty hard to use C profile in production system for that reason,
because the most functions you have in your program, the slower it's going to be. C profile in particular has this one program where in our case it was an issue where it exports its data, so in a custom file format, which is something like PQL if I remember correctly, which is not something that is really easy to use
in any other programming language. So for all of those reasons, it's impossible to use something like deterministic profiling and C profiles in production system. You can't afford to have your web server to run 20 times slower just because you're doing profiling. The other way of doing profiling is different. It's like you are going to run your program normally,
you're not going to change anything, but every now and then, like every 10 millisecond or so, you're going to interrupt your program and to look at what it's doing right now. So what it's going to tell you if you do that often enough, you're going to have statistically a good representation of what your program is doing usually. So it's not like an absolute truth
contrary to something like C profile, which really meters every call. You're just going to, in that case, statistically over a few hours or a few days of your program running, you're going to have a number of samples that's going to show you which number of time you see each stack trace, and you're going to have the kind of people that I showed you earlier, which is roughly statistically accurate.
The upside of doing that is that you can actually measure a lot of different things like the wall time, but also the CPU time, which tells you if you spend time just doing IO or actually using the CPU and doing things like computing results. It has a very low overhead. The way it works is that it leaves your program
doing what it does usually, and does not try to wrap the functions or interrupt them or whatever. It just now and then executes something in the background, saying, okay, what is it doing right now? And that's it. You were able, thanks to Python, at least to go down to the line of code that is being executed, so you're not just talking about functions,
but you're able to see which line of code in particular is going to be executed. In some languages, not Python, unfortunately, you can go down to the opcode, though the precise operation in the line, that's not possible in Python yet, unless you speed your code a lot. And you can have over things, I'll talk about that later,
see the report against the exception running in your program, while it's being raised, and you're able to see them and to see the control flow in your program. So what I'm going to talk after is what we did at Datadog implementing this. Before doing that, I spent a lot of time checking the state of the art of the dozen of Python profiler out there.
None of them does that. A lot of them actually focus on their output, doing fancy frame chart, or things like that, which is not something I had to solve. I was only interested into getting the data and exporting them into a proper data format. So that's what I'm going to talk about just now. So the first thing I had to solve was how did you get the CPU and world time information?
So the way I solved that is in a thread, so you run a parallel thread in your Python program, but it sleeps most of the time, let's say every 10 milliseconds. You wake up, and you're going to use the sys dot colon thread, which is like a private function
that has been there forever and that everybody uses, so it's not really private. But what this function returns is a list, a dictionary where the key is a thread and the value is a stack trace, basically. So if you do that every 10 milliseconds, regularly, you're able to each time see which stack is being executed on each thread,
and you're able to construct a model where you know in real time what your program is doing every 10 milliseconds. When you do that, you can actually use, so that's only Linux-specific for this part, function that is pthread gets CPU clock ID, which basically returns you a special clock
in Linux which measures the CPU time that each thread is consuming. So every 10 milliseconds, you will be able to get a stack trace and actually associate that stack trace with a number of nanoseconds or milliseconds of CPU that has been used by the thread. So at the end, you get a pretty good representation on each of your threads, what they do,
and how much CPU they used. You can actually also get the whole time because you can use a regular clock. So you get this kind of input where you have the regular, so this is an example of a Python program where you're being profiled for one minute with four threads, that's why you may not see, but there's four minutes on the top, root span on the top.
There's four minutes of whole time in total because you've got four threads running for one minute, and you have the detail of which function has been called in which thread, and you have the same data for the CPU time, which is way less, there's only three seconds or five seconds of CPU time being used. And you see, the output looks a bit the same, you see the same stack traces,
but you see that a lot of the functions are like weight functions that are just doing nothing. So if you were using something like C profile, you would see a lot of time being spent in the weight function, which is not very interesting because it's just a weight function. Whereas if you take a look at the CPU time, using the data you got for the T thread CPU clock ID, you actually get the number of nanoseconds
being used on the CPU, and you can see here, for example, the weight function, we don't see them in the charts below. So as I was saying, it's a way better way to do profiling in production system because it has a very low overhead. So for example, on my laptop,
if I am to profile a Python program with 10 threads, which have each one executing 30 function long stack traces, I can poll and generate 100 samples per second for only 1% of CPU. So it's already a pretty good thing to get,
and it actually can get a bit cheaper because the way it works, you sleep every 10 milliseconds, but you can actually increase that amount of sleeping. So if you want to have less impact of your performance, you can dynamically adapt your sleeping time. Like I was saying, you're able to get, so you need to do a bit of C for that because it's not excellent in Python,
but you're actually able to get the profiling exception information. So when you use a block like try accept, and it goes back to a few functions before being called, you can see that in Python, it's equivalent of sys.xinfo. You can catch that and see the control of the program.
And you can limit the resources, I was saying you can basically sleep more often if you need. Right now, the only downside is that we don't do C profiling. So if you have a siphon module or C module, you are not able to see any information below the Python stack. That's something I'd like to solve in the next release.
We also do memory profiling. So for memory profiling, it's way easier because I'd say the most of the work has been done already in Python. So there's a Python module since 3.4, which is called trace-marlock. So what trace-marlock does is that it replaces
the malloc function used by Python to allocate the memory and all the objects that you use in Python by its own wrapper, which allocates the memory, but also make a trace of the allocation. And trace-marlock is able to export to you the information about which line of code allocated a memory, and like many megabytes,
and it has been called. So it's pretty easy to get this kind of information without doing too much work. The primary trace-marlock is that it wraps and it counts every object allocation that you do, which makes your program very slow again. So that's not something you can do on production system if you have a program
that does a lot of allocation and gets four times or 10 times slower just because you are tracing the allocation. It's a problem. So the trick here is to also do sampling and doing statistical aggregation of the results. So the algorithm is pretty simple. Like you sleep most of the time, and once in a while, you just wake up,
activate or deactivate trace-marlock to get the samples or not of memory allocation. If you do that like 5% of the time, it's usually okay. You only get 5% of the allocation being tracked by trace-marlock, but over time, if you do that for a few minutes, a few hours, you get a pretty good representation of your patterns in your application.
So it gives you two results, as I was saying, the number of allocations that each of your functions does and the size. For the two charts, it looks a bit of the same, but the width is going to change based on if the allocation is going to allocate bigger objects or not. But this is the kind of result that you might get at the end with stack traces
and the number of megabytes that you're going to have allocated. So pretty easy, again, to debug your program to see what's costly in terms of memory. So that's only count the allocation that you do, not the size of the memory you used. That's only the number of times you allocate memory. It doesn't track when you free it or if you don't use the object anymore.
So right now, the limitation of trace-marlock, like I was saying, is that the overhead, it's really slow, really, really slow. I'm not sure right now how fast you can, I mean, if you can improve it, you make it faster. There's no thread information, so the way you track things, it's pretty simple. You don't have any context for which thread
did the allocation, so you can't map that to any threads. And it's also limited in the information it reports. For the CPU and wall time before, it's pretty easy to get the function name because it's reported in the stack. In that case, trace-marlock just gives you the line number and the file, so you don't know which function or which class
you are actually trying to allocate. Even which type of thing you are locating, it's pretty light in that regard. So that's something I'd love to improve in the future. Another thing where we do a bit of profiling is threading. So usually when you do multi-threading in Python,
so a lot of people don't do multi-threading because of the global log and how it makes things slow, et cetera. But you still need some time to have threads running. When you do application with a lot of threads, you need to synchronize over something, and usually you use a lock, which is like the most basic object you can get in the threading world in Python.
And that lock, when you have a lot of threads trying to acquire it, you might have contention, and it might make your program slow. There's nothing in Python to basically to give you that kind of information about, is my thread blocking because it's trying to acquire a lock or not?
Is it waiting for too long? Is it waiting forever? What is my contention issue here? It's pretty hard to know. So the way we solve that is we basically wrap the threading.lock instances, so each time you thread a new lock, you actually get a lock which is not a real lock. I mean, it's a real lock, but under the rules, we intercept it
to see when your program is going to acquire it and when it's going to release it. That makes it possible to actually get the information about how much time they spent trying to acquire the lock or how much time they spent acquiring the lock, like the lock being held by one of the threads. So you get a lot of information. For example, you get the total lock acquirer wait time.
So this is the number of milliseconds, usually, not that long, that your program is trying to wait, is going to wait to acquire that lock. So with this kind of information, you are able to see that you may need more lock or to split your program in different parts, to not use only one lock, but several locks, et cetera.
You can get all of this information to be able to redesign your program, which is not always easy and possible, but sometimes it is, and you get all of that with the different threads that try to acquire the different locks. So once we build all of that, we had to export all the data to use them,
which is its own challenge. So like I was saying, C profile has its own custom format, it's not very interesting, not useful for us. There is one format which is pretty used in the world, which is the Calgrant format. So it comes from the Valgrant program, which does profiling for C.
And it has its own text-based format that you can use to export data. The problem is that it's text, basically, so it's not really, it's easy to consume, but it's not really efficient in any way. And so that was a good solution for us.
Like I was saying, a lot of Python profiler then directly export HTML, so that never gave us any clue on how to export to something that you can use for any of our program. And typically from pprof, so if you don't know pprof, it's a lot of things. The name of the profiler that is in the Go programming language
is also the name of the tool they use to display the data, and it has the name of the format, so it's like the same word for four different things. So the pprof format is pretty cool because that's what you can use for if you do Go and Go profiling, you can export all of your data in pprof format. So it's a format based on protobuf, so it makes it pretty easy to understand,
there's a schema with all the fields, the key material, et cetera. It's easy to use, and protobuf is easy to use in Python too. It's very fast to serialize, so it would be a shame to try to be fast gathering all the data about your Python program and then being very slow about exporting it,
so that's not a problem. The way the pprof format works is not going to export everything that you gather, like every sample, every timestamp, et cetera. It's going to aggregate all of that, so you're going to compute the sum, the average, the minimum, whatever you need, and export all of that with the stack traces in your format at the end. And it's pretty space efficient,
so we use a pretty simple model of a string pool where all your string, because you have a lot of repetition usually, like all the functions you're going to see are always the same, the stack traces are going to be most of the time the same, so all of that is pretty well, I'd say, organized in the schema, and it's just zipped by default, so pretty fast to generate, pretty slow,
pretty small to store. For example, on average, we use 20 kilobytes of storage to store one minute of profiling for one Python process, so pretty easy to store, even for a few hours or a few days. So pprof is also the name of the tool
that you can start pretty easily if you know a bit of Go, but it's not that hard, so it's able to generate this kind of frame charts, which you see on the top left, where you see all the stack traces, and the width is going to be either the size of the location that you do in memory, the time you spend on CPU,
et cetera, et cetera, depending on the status that you gathered. You can also use this kind of chart on the right, where each rectangle is going to be one function, and everything is going to be connected by the number of times they call each other, et cetera, so it's pretty hard to read when you're not used to it, but I promise if you spend a few minutes on it,
in pprof, it's pretty easy to gather and to understand. So it's easy to use. You can open any pprof file with it, being Go or Prifon or anything. So what we did at Datalog is an open source library, which is not yet public, just because for them it's too early, but I'm working on it.
It's going to be integrated into our tracer library, which is already open source. So the profiling part is going to be integrated into that, and if you want to use it and you're not using Datalog, it's fine because it also supports exporting the data to a pprof file, files, with an S,
so you'll be able to use it directly and to export all the data in pprof and use the tool I showed just before. So as I said, it's not released yet, so if you want to know when it's released, you can follow me on Twitter or anything, and I'll be able to announce it. It should be in a few weeks, no promise, but I hope so, and that's it.
We have four minutes for questions.