Is it me, or the GIL? - TIB AV-Portal

Is it me, or the GIL?

00:00

3

Related Material

Heer, Christoph

Formal Metadata

Title

Is it me, or the GIL?

Title of Series

EuroPython 2019

Number of Parts

118

Author

Heer, Christoph

License

CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this

Identifiers

10.5446/44758 (DOI)

Publisher

Release Date

Language

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

Python's Global Interpreter Lock is a friend and rival at the same time. We, as developers, can focus on the design and implementation of applications without the hassle of memory management. On the other side, we complain about the GIL as the limiting factor of performance sensitive applications. Therefore, it is common to refactor parts of systems when the system doesn't perform or scale enough anymore. The refactoring often includes the switch of the used concurrency paradigms like replacing multithreading with multiprocessing or asyncio. Another option is moving logic of CPU-bound workload into C extensions or a full rewrite in a ""GIL-free"" language. But how do you know that the GIL is the actual performance bottleneck? While scaling and developing performance sensitive components in Python, my colleagues and I often also assumed the GIL as cause of our performance problems because it is a common and simple answer for this usually complex and varied problems. Instead of starting a rewrite or major refactoring, we took a step back and tried to prove our assumption. With the result that analyzing the impact of the GIL contention on the overall performance is a very interesting problem without common practices or easy usable set of tools that support Python developers. Within this talk, I will share and explain the methods and tools, which we use to analyze the relevance of the GIL on our application performance and how it helped us to stay focused on the actual problematic areas of our applications that required improvements to meet our performance goals.

Keywords

ASYNC / Concurrency

Multi-Threading

Speech

Text

Image

00:00

Software testingComputer hardwareService (economics)Computer hardwareProduct (business)Structural loadCartesian coordinate systemService (economics)Semiconductor memoryCharacteristic polynomialScaling (geometry)In-Memory-DatenbankContext awarenessSource codePhysicalismEnterprise architectureVirtual machineComputer animation

01:05

Software testingMaizeComputer hardwareService (economics)Internet service providerVirtual machineInstance (computer science)Mach's principleScheduling (computing)Thread (computing)Task (computing)Function (mathematics)Different (Kate Ryan album)Error messageOperations researchMultiplicationBefehlsprozessorFormal languageMetric systemWritingTorusAttribute grammarInformationData typeDefault (computer science)Module (mathematics)Integrated development environmentDisintegrationStack (abstract data type)Overhead (computing)Process (computing)Dynamical systemInteractive televisionVisualization (computer graphics)Event horizonCartesian coordinate systemInstance (computer science)Visualization (computer graphics)Active contour modelStructural loadScheduling (computing)State observerCivil engineeringNumberService (economics)Physical systemRun time (program lifecycle phase)Context awarenessInterpreter (computing)MereologyMultiplication signWorkloadVaporContent (media)Integrated development environmentOperator (mathematics)Task (computing)Event horizonPoint (geometry)Tracing (software)Variable (mathematics)Inheritance (object-oriented programming)Theory of relativityImplementationInterface (computing)QuicksortResonatorState of matterGreatest elementPoint cloudChief information officerVirtual machineoutputThread (computing)DatabaseStaff (military)MultiplicationProcess (computing)BitRewritingFlow separationModule (mathematics)Functional (mathematics)CuboidMetric systemCASE <Informatik>Software frameworkElectronic mailing listInternetworkingComputer hardwareSource codeData centerStatisticsCloud computingMotion captureMatrix (mathematics)CodeMehrprozessorsystemException handlingUtility softwareCode refactoringBefehlsprozessorProduct (business)Level (video gaming)Stack (abstract data type)Graph (mathematics)System callAddition1 (number)Computer animation

10:28

Visualization (computer graphics)Interactive televisionThread (computing)Concurrency (computer science)State of matterStatisticsProgrammable read-only memoryMetric systemPerformance appraisalEvent horizonPhysical systemEntire functionOvalFluid staticsWechselseitiger AusschlussAtomic numberRun time (program lifecycle phase)Parameter (computer programming)MeasurementCalculationProcess (computing)Execution unitMeta elementExplosionHistogramLattice (order)Computer iconPhysical systemCalculationFunctional (mathematics)Event horizonTraffic reportingImplementationCycle (graph theory)Integrated development environmentProfil (magazine)Cartesian coordinate systemData typeMultiplication signInformationNumberLetterpress printingThread (computing)Process (computing)Interpreter (computing)Inheritance (object-oriented programming)Scheduling (computing)BitMereologyHash functionData storage deviceLevel (video gaming)Utility softwareStatisticsIdentifiabilityStructural loadContent (media)Analytic set2 (number)Virtual machineSoftware frameworkAsynchronous Transfer ModeMeasurementKernel (computing)Run time (program lifecycle phase)HistogramLibrary (computing)Basis <Mathematik>Concurrency (computer science)Drop (liquid)DeterminantReading (process)Group actionMoment (mathematics)Network socketVisualization (computer graphics)Point (geometry)Pairwise comparisonCausalityBefehlsprozessorRootAreaPatch (Unix)Distribution (mathematics)Attribute grammarForm (programming)outputType theorySimulationControl flow1 (number)Pattern languageMilitary basePiAverageCASE <Informatik>Game theoryKey (cryptography)DialectComputer animation

19:50

Thread (computing)Software maintenanceDemonRun time (program lifecycle phase)MeasurementAreaBefehlsprozessorPerformance appraisalFrequencyMilitary operationCodeFunction (mathematics)Physical systemProcess (computing)Theory of relativityChainKernel (computing)Source codeInstallation artStructural loadExtension (kinesiology)Mathematical analysisScheduling (computing)Task (computing)LaptopCartesian coordinate systemContent (media)Variable (mathematics)BefehlsprozessorExtension (kinesiology)Thread (computing)Run time (program lifecycle phase)Integrated development environmentImplementationRepresentation (politics)AdditionGroup actionMultiplication signKernel (computing)HistogramCondition numberEnterprise architectureType theoryVirtual machineLaptopVisualization (computer graphics)Process (computing)Different (Kate Ryan album)Installation artScripting languageInterpreter (computing)Loop (music)Operator (mathematics)Limit (category theory)Physical systemPattern languageGene clusterFunctional (mathematics)BytecodeAnalytic setMultiplicationLetterpress printingLibrary (computing)Set (mathematics)Computer architectureBitSlide ruleReal numberMeasurementStructural loadInformation security2 (number)ResultantState observerMereologyVideoconferencingCASE <Informatik>Traffic reportingoutputTable (information)InternetworkingComputer fileScheduling (computing)Frame problemOrder (biology)Sound effectRoutingMathematicsCodePlanningRevision controlInsertion lossComputer animation

29:12

CuboidBitMereologyThread (computing)2 (number)Scheduling (computing)Web applicationQueue (abstract data type)Task (computing)Metric systemGene clusterWeb-DesignerAreaZoom lensWeb pageVisualization (computer graphics)Content (media)XML

31:02

Thread (computing)SummierbarkeitServer (computing)Distribution (mathematics)Stylus (computing)FrequencyPhysical systemCode refactoringExtension (kinesiology)Metric systemThread (computing)Extension (kinesiology)Object (grammar)Metric systemCartesian coordinate systemProbability density functionMultiplication signDiagram

31:50

FrequencyExtension (kinesiology)Physical systemMetric systemThread (computing)DisintegrationVisualization (computer graphics)Scripting languageUsabilityAreaKernel (computing)CompilerIntegrated development environmentMaxima and minimaAdditionMultiplication signMetric systemLimit (category theory)BefehlsprozessorMultiplicationState of matterScripting languageAreaOperator (mathematics)BuildingIdentifiabilitySet (mathematics)State observerFunctional (mathematics)Profil (magazine)FeedbackMatrix (mathematics)Point (geometry)Thread (computing)2 (number)Cartesian coordinate systemMeasurementAssembly languageContent (media)Extension (kinesiology)Process (computing)Interpreter (computing)Task (computing)MathematicsPhysical systemScheduling (computing)DampingCode refactoringInsertion lossBitOnline helpEvoluteInformationXMLComputer animation

Transcript: English(auto-generated)

00:04

I want that smile. The background is that, as I was already introduced, I'm actually doing quality insurance with an SAP for one of SAP's major products, SAP HANA, which is an in-memory database. And it's basically powering the various enterprise

00:22

applications you can find in the portfolio of SAP. What we are actually doing is we are testing each commit coming into the source code. And in our scale, that means we are testing around 800 commits every day. Therefore, we are operating a small infrastructure

00:40

of physical hardware with around 1,600 machines. And as we are testing an in-memory database, we need a huge amount of memory. So overall, we are currently using around 610 terabyte of memory across our landscape. So the problem is, if you have such an infrastructure,

01:02

if you have such load characteristics, you need optimized services, optimized tools to handle such incoming load. And that's actually the main part what my colleagues and I are doing. So we are developing such tools, such services, which are optimized for our workload. And one of this tool is our own task execution framework,

01:24

which we put on top of Apache Mesos. Apache Mesos is something similar to Kubernetes, but a bit more low level. So what it provides us is some kind of an interface to resources of machines. So in the bottom, you can see that we have multiple data

01:42

centers with our own physical hardware, but also various cloud providers with cloud instances. Every time an instance has some resources available, they will send them over Apache Mesos to our own task scheduler. And our task scheduler then has to decide, okay, what kind of task should I now schedule

02:02

on this available resources? Now, the problem is, if you add more and more machines, this also gets a bit more complicated. Especially in that case, our task scheduler is receiving more and more incoming offers

02:21

with more and more events about changing states of tasks or task finish maybe. So you have now the opportunity to launch a new task or a task fail, so you have to reschedule it and you have to handle all these events. And the problem is at some point in time, we hit a bottleneck that our scheduling system

02:40

was not able actually to handle all the incoming events and use all the resources in an efficient way. And now you can imagine with such amount of hardware, you actually would like to use them as efficient as possible. You can also see that around our task scheduler, there are various other services and databases.

03:01

So the task scheduler is basically just interacting with all these various services and doing a lot of IO operations. Therefore, the initial design of the task scheduler is basically a big Python application with various threads who are handling the incoming data, who are processing the incoming data,

03:22

which find the best way to schedule a certain task on a certain available resource of. We also have other threads around in the process itself, which are required for our observability stack. For example, we have a thread who's responsible to transmit all the exception data

03:42

to Sentry or distributed tracing is in place for our applications. So we have also a thread who's responsible to send that over to a Jaeger instance. Now we have the problem that we have not the best performance anymore. Resource utilization goes down.

04:01

And the nice thing is about our observability stack that we can now actually inspect each thread and each part of the system and find out, okay, where's the bottleneck? And the same method we also applied to find out, okay, why we cannot utilize all the available resources? Therefore, we have to take a look inside of the resource offer handling thread.

04:23

And that's now a semantic visualization of our distributed tracing system which we are currently using. And we can see the required time for certain operations. So for example, we see how long we need for selecting an offer for a task or how long we need now to prepare actually the task

04:42

so that it can be scheduled. And the first strange thing is that for the same function we have actually different runtimes. So for example, for selecting an offer we have a variant of various runtimes from 700 milliseconds down to 30 milliseconds

05:01

which is actually quite strange. The next interesting thing is that we have also increased latency in a way that it's actually not expected. So what do you see in the highlighted boxes is the first thing which is the span

05:20

which is captured on the schedule 11. So we're capturing that we need around 200 milliseconds for this API call. The service which we are actually asking is also transmitting the data into the same system. Therefore we know that the service itself only took around 30 milliseconds to process the API request.

05:42

So that's also a bit strange. And we would also like to investigate there a bit. The next remaining thing which is also strange in this capture is there are also gaps between operations. And if you take a look into the code there are actually no gaps.

06:00

Instead of the prepared task operation there are two sub-operations, these two API calls. There's nothing in between. Why is there latency of multiple milliseconds in between? And then we started to assume, okay, I mean, we started, we are using threads. It must be the global interpreter.

06:21

We are hitting the global interpreter lock. There's a contention and that's now our bottleneck. But I mean, that's no problem. You just open a browser and perform some research and then you will find a lot of various ways how to mitigate the global interpreter lock contention.

06:41

So we could just start and replace all the multi-threading thing with multi-processing or async IO stuff. Or we could pinpoint certain CPU function which are CPU bound and micro M2 siphon which are actually releasing the GIL in certain senses. Or let's rewrite everything in a more faster language.

07:03

But if that would be now the solution I would probably not talk now about the global interpreter lock. In general, we have to say that such rewrites or major refactorings are super expensive. I mean, we're talking about a productive system who's powering a huge amount of workload.

07:20

And we would actually like to invest more time on new features, make it even more efficient. But now handling such performance problems is also important but we have to use the right things to actually solve the performance problems. Therefore, we took one step back and decided,

07:40

okay, let's first analyze the problem. Find out, is it actually the GIL contention? And if it then detected that, we can actually go and maybe decide even better what kind of mitigation is the best one for our application. So let's take a look on the GIL. I mean, it's probably easy. You just import a module,

08:02

for example sys for the Python interpreter and then you ask the Python interpreter how it's going with the GIL. Sadly, that is not true. There is no simple function which you can use to get some GIL statistics. Which means that we have to think about something else. And then I thought a bit about,

08:21

okay, what I would actually like is to know about the GIL of various things. The first thing I would like to know the basic fundamental metrics of a log. I mean, at the end, the global interpreter log, it's a log. So I would like to know how long does a thread actually wait for the log

08:42

and how long is certain thread is actually holding that log. If I know that metrics, and I mean, we all know that, if you have just some numbers, they are often not so useful. Therefore, I would also like to have some additional context, okay. Which thread it is, which Python function

09:00

is actually suffering from that contention, maybe. Is it maybe possible to get the trace ID or request ID in that part? Because then I can correlate that with other systems. And I would actually like to use this in our productive environment because I cannot reproduce that in my local machine because my local machine is not connected to our cluster with multiple hundreds of machines.

09:24

So that's a problem. And in best case, it would be integrated with our existing observability stack so that my colleagues and I don't have to learn another tool, how to use it, how to interact with it, that's all. So with that list, I went through the internet

09:42

and looked out what are the variable tools? How can I analyze the code? I mean, I'm probably not the first person who is thinking about that problem. So there are some related ones. There's, for example, a super interesting talk by Dave Basley from Python 2010, sorry, PyCon 2010.

10:05

But it's actually quite up to date because it already talks about the new GIL implementation which we have Python 3.2. And within his talk, he's explaining how he measured the GIL contention and stored the data so that he actually

10:23

was able to generate such nice graph. The problem is the instrumentation and so on only works with Python 2.6 and there are some other problems which make them not so usable in our productive environment. Main problem, you have to shut down the interpreter at the end to dump all the data out

10:41

and then you have to generate the visualization out of it. I actually don't like to shut down our scheduler because I mean, every minute it is not running, we are basically losing resources. The next thing which came into my mind is the thread concurrency visualization of PyCharm

11:01

which is actually a very nice tool to visualize log contentions in your application. The problem is PyCharm doesn't take the GIL into account. In that example, I just run an application who is heavily GIL-bound because it's basically always wasting time in CPU cycles

11:22

but you don't see anything. There's a new implementation called GIL load which is quite interesting because it's a profile which you can easily integrate in your existing Python application and during the application runtime

11:41

it will print out a load number. It's comparable with the load of a Linux system, for example. The problem is at the end you have only a number. You don't know, okay, what is now the root cause of that problem? So if I do that on a application who is heavily bound to the GIL

12:00

and who runs always into the same contention, I just get the information, yes, you have a problem but no other information how to solve that. A very promising approach is PySpy which is a new profiler for Python written in Rust which is very fast

12:21

and you can easily attach it to a running application and then you can get this nice overview about what functions take what amount of time and so on. And it also includes the GIL utilization which is quite nice but again, you don't have a breakdown to find out, okay, what is now the problem about it?

12:45

The truth is there is no magic GIL contention analytics tool and that means we probably have to do it by ourself. Okay, that's great a tool which actually reveals the GIL and hopefully it is able to give me all my wishes

13:04

about such a tool. For that, we can use an existing framework and basis system tab which is available on Linux machines and it allows actually to analyze applications by attaching certain event handlers to applications.

13:27

So, sorry, it actually allows to attach handlers to certain events which are emitted by applications or by the Linux kernel itself. And then you can do certain calculations within this event handlers

13:41

and print out the measure times for example. The nice thing, CPython 3.6 actually introduced support for system tab and dtrace and there are already some markers, some event emitters who you can use

14:00

to analyze your Python application with system tab. So for example, we have function entry and function return which will be invoked every time the interpreter goes into a new Python function or returns from Python function. I can highly recommend the documentation about that because it's super verbose and helps really to understand the concepts of system tab

14:21

and how you can use it. And I also saw in the schedule of this conference there's another talk about low-leveling profiling which will also cover system tab. Problem about this approach is that most prebuilt Linux packages actually don't include a Python interpreter which is compiled with dtrace support.

14:40

So the markers are not there, you cannot use them. Also, there are no markers for the GIL-related areas where you would actually like to know when a thread acquired something or when a thread is actually dropping the global interpreter node. But especially the last part was actually not so complicated to implement.

15:03

So this is just one part of the patch who introduced some markers regarding the GIL. So every time a thread will now drop the GIL will emit an event about that. And every time a thread tries to claim the GIL it will also emit an event of that.

15:23

As you can see, we can also add arbitrary attributes to this markers which will be then accessible in system tab. In that case, I'm using the Freddy dance so that I have an understanding and I have an idea what thread is actually performing this action.

15:43

Now you may be asked why you don't use the thread names. I mean, every application should have nice thread names so that you can actually recognize the threads. The problem is at the moment there is sadly no C API to get the thread names without actually holding the GIL which is a bit complicated

16:02

if you would like to measure the time until you actually have the GIL but you need the GIL to get the thread names complicated. But I think that's also something we can solve at some point in time. Okay, now we have these markers. Now our Python interpreter is emitting these events at a point in time

16:20

that are past. Let's measure the time at that point. We are attaching probes, so-called event handlers to these events. For example, we are instructing system tab that it now should look inside of the shared library of the Python interpreter where the markers are located.

16:42

And for example, if we now go in the GIL claim part so a thread would like to acquire the GIL, we'll measure the time of today in nanoseconds and store that in a hash map of system tab. As key, we are using the thread identifier. If we now acquire the GIL,

17:01

we can actually calculate how long does it take until the thread acquire the GIL and also store that in another data type of system tab which is an aggregate which allows us to actually get some statistics out of system tab. So for example, distributions in the form of a histogram

17:20

or an average and so on. And the same thing we also do for the GIL drop where we can actually then now calculate how long we actually hold the GIL. Now we are calculating these numbers. What we actually would like is to print them out in some kind of report.

17:42

For that, we can use the handlers which will be invoked at the startup and determination of the system tab tracing session. So if the tracing session stops now, we can print out some nice debugging informations, terminating the tracing and print out a summary

18:02

over all the measured threads with the respective timings. Let's do some example. In the first example, I have a Python process with two IO bound threads. By the IO bound or how do I actually make an IO bound thread is by simulating the IO with time.sleep

18:22

because it's basically the same behavior as for example, a read request on a socket. Before actually the thread goes into the sleep mode for a certain number of seconds, it will actually release the GIL. After the sleep is complete,

18:40

it will try to reacquire the GIL. The same thing also happens on a socket. You would like to read something from a socket, you define the amount of bytes you would like to read as long as there are not enough bytes or actually no data at all available, it will pluck and it will wait. But before that, Python will release the GIL for you.

19:02

If we now measure that, we can actually see that we actually don't have any problem with that application. You can see that the main thread only had to wait a bit more than one millisecond on the clipper in the Preta log, which is quite nice, especially if you know that this application runs for 15 seconds in that simulation mode.

19:24

If you're looking on the whole time, we also see that the main thread is the most prominent thread who holds the GIL the most of the time, which also makes sense. It has to import the libraries, the threading library and so on. So the most time will go probably for the initialization.

19:43

The IO threads itself are super lightweight, so they don't have to wait that long on the GIL. And if they have the GIL, then also they don't consume that much time with the Python interpreter. Overall, we can say we hold the time only zero to 2% of the full runtime

20:04

and we also had only a wait time of less than 0.01% of the full runtime. So basically no GIL contention at all. And in that situation, the GIL is not a problem.

20:21

We can now change that easily if we go and introduce a CPU bound thread. How do we simulate a CPU bound thread? It's actually quite simple. We just need an endless loop who's doing nothing. So this while loop with a pass inside of the loop itself, will do all the trick for us and we now have an CPU bound thread.

20:43

If we now take a look on the timings, we can actually see a GIL contention. We see that the main thread has the same behavior as before, same hold time, but increased wait time. Okay, that's already a problem probably. The most important problem is that our IO threads

21:03

now have a much higher wait time to get actually to GIL. So we are now waiting more than 700 milliseconds just for getting the GIL. And we are actually only waiting around 100 milliseconds in our application.

21:22

The hold time is still the same. And the CPU thread, which is CPU bound, is actually consuming all the variable CPU time and therefore holding the GIL nearly the full runtime. So overall, we now see that the GIL was basically

21:40

always active by at least one thread and we also had some wait time in there. Even more interesting is the latency for the IO threads. If you are taking a look on the histogram of the IO threads, then you will basically notice after some time

22:03

that the latency is quite stable between four and eight milliseconds. And that's quite interesting that it's so stable, but it also shows us one main disadvantage of this GIL contention. The GIL contention already affects

22:21

the overall performance of the application by introducing five milliseconds additional latency to any unlock of the global, sorry, any attempt to acquire the GIL after blocking IO.

22:41

Which is actually quite a problem in some cases, because normally you run multiple IO operations during, for example, an HTTP request. Why it's so stable? Because of the internal switch interval, which is an implementation detail of the GIL, but it's actually quite interesting to see that in action.

23:04

Every time a thread would like to acquire the GIL, it will check if someone is holding the GIL. If someone is holding the GIL, then it will go into a condition and will sleep up to five milliseconds. If now the GIL is still acquired,

23:21

it will send out a request to drop the GIL so that the other thread who's still holding the GIL should please release the GIL so that the new thread can take over the global enterprise. That means this can add already this additional latency

23:42

if you only have Python instructions. Then you will basically always get the stable latency of around five milliseconds. The problem is a thread who's holding the GIL can also hold the GIL even longer than this five milliseconds because some bytecode operations take longer

24:00

or you're calling out into an external C functions, but actually doesn't know anything about the GIL and also don't release the GIL. Now with that, we have some tool sets which we can use to analyze our application, our productive application. So here's the plan, how we do that.

24:20

We deploy our new container with our custom C Python instrument, our custom C Python version, including system type. On our cluster, we go to this machine which is running the scheduler, we attach to the process and we get some nice insights about the GIL contention.

24:42

In reality, it was a bit different. It was quite easy to deploy the container with the custom C Python version, no big deal. But at the end, we had to install system type on the host because it's actually not that easy to get system type running inside of a container. And that's a bit related to the architecture of system type.

25:03

System type is actually transforming your script which I just showed some slides ago in a real kernel extension and loads it at runtime. And then it is that kernel extension running to measure all the timing and in the end, the data will be printed out.

25:24

That's possible. I also did that in our productive environment. It was interesting, but I don't recommend that. Especially if you're talking with a security guy, he will be probably not happy if you start adding custom kernel extensions

25:42

in your productive environment and run some processes as root, your full isolation is basically gone. Nevertheless, I measured some nice results. Over two seconds observation of our process, I found out that we actually hold the GIL around 88% of the full runtime,

26:03

sorry, of the measure timeframe. In the same timeframe, there were so many threats waiting for the GIL that we had an overall wait time on the GIL of nearly 300%, which actually proves, yes,

26:20

our application really suffers from a GIL contention. It's not great, but it actually revealed even more questions. The problem is, the main question which first came up is, okay, are there threats who are holding the GIL longer than five milliseconds? Or does all threats actually give up the GIL quite fast

26:44

and we are hitting the limit of switching around various threats? That would be one possibility, but let's take a look if we actually see that in our infrastructure. If we see that, I would actually like to know, okay, which function is taking so much time

27:00

and which function is not releasing the GIL? And also, one question which came up was, is there maybe some kind of clustering in these measurements? So it is maybe possible to identify some clusters for certain operations that are certain patterns in this measurement that are very intensive

27:22

for the global interpreter log contention. Overall, the biggest problem was that with 31 threats, it was actually quite hard to read this text report. And that's also a problem with so many threats

27:41

that such tools are often not optimized for that amount of threats and I would not recommend that. So one thing which came into my mind is that timelines are much easier to understand. I mean, if you take a look on distributed tracing systems,

28:00

they're actually quite good in visualizing how long a certain operation takes. It's much easier to recognize how long this operation takes based on the size of the span individual representation. Okay, could we do that the same thing, please also for the GIL contention and maybe find out what's going on in our application.

28:21

Therefore, the idea is actually, let's use still system type because at least it proved that we can collect the data, collect the data with system tab and print out a text file, which we then can load into a Jupyter notebook and can do various analytics stuff on that,

28:43

can create some charts, some nice visualizations, all the things we can do with our data science tools. And I think that's one of the most nicest things about Python nowadays that we all have this nice visualization libraries at hand to make data quite easy to recognize.

29:04

What I did is to visualize that with Bokee. This is now the times visualization of our 31 threads over two minutes. What you can see is that in the areas

29:21

where we have dark blue boxes, there we have a GIL contention, which is actually, we have a GIL usage, which shows that Fred is holding the GIL longer than 50 milliseconds. Please remember, switch interval, five milliseconds.

29:42

In the red boxes, in the red areas, we have now Fred who are waiting for the GIL longer than 50 milliseconds. We are now thinking about optimizing web applications. You normally try to aim as fast as possible and every web developer who says,

30:01

yeah, 300 milliseconds are totally fine for a web page. I'm not sure about that. Okay, let's zoom in a bit and we can take a closer look. So these are now 20 seconds, five seconds. Now we came to the interesting part of one second.

30:22

And what we saw is that they're actually a clustering of big blue boxes. So for example, this part is taking more than 500 milliseconds. 500 milliseconds, this thread is doing all the work. No other thread can do anything.

30:40

And the first question was, okay, what is this thread? It's actually a thread who is collecting metrics of our scheduler is sending out this metrics into a central system so that we know how many tasks are running, how our queues are currently and so on. And after adding additional visualization,

31:02

I found out that this thread is actually consuming 75% of the full GIL whole time, which is super expensive if this thread is only taking some metrics out of the application.

31:22

Okay, so now we know what is the problem. And the nice thing is, now as we know the problem, we can actually fix the problem. So what we did is we start to replacing the C extension which we were using inside of this thread to calculate all these various metrics.

31:41

We actually found out that the C extension never released the GIL because it's actually internally using heavily Python objects. But if you read normally C extension, at least this was the intention by reading that, by choosing that C extension, then you think normally, yeah,

32:00

it probably released the GIL quite early in the time. It's not always true. Also, probably most simple fix was to just change the interval of our metrics collection. So we just changed the interval from 10 seconds to 120 seconds.

32:21

So we are now collecting less often intervals, but I mean, if it can solve the GIL contention, not a big deal. And that actually happened. So before we saw a huge usage of time, huge usage of CPU time where the GIL was required for processing.

32:43

And also a huge amount of time where threads are waiting for the global interpreter. After applying these simple fixes, these simple changes, it was actually possible to reduce the time by basically half the time.

33:02

So we only now hold the GIL for around 43%, and we only wait 80% on the GIL. So the wait time is not great, but nevertheless, with these simple fixes, we were able actually to speed up our task execution scheduler so good

33:20

that we can now actually use all the available resources. And it basically saved us a huge amount of refactoring of our application, of replacing multi-threading with multi-processing our asyncIO, where we had basically to think about, yeah, is it maybe easier to rewrite the full application from scratch? And I mean, that is probably

33:42

the most important information out of performance-related stuff. Please try first to find out if you have that problem, if it makes sense to invest in that area. And probably you find even a better solution if you have more insights about that.

34:01

And especially now that will help us to find out, okay, what is now the next best evolution and step for our application? Does it now make sense to introduce multi-processing or does it make sense, for example, to rewrite some functions which are a bit more CPU-bound than other functions and move them into a siphon, for example,

34:22

where we can release the GIL. Also, this full topic revealed many multiple additional ideas how we could actually improve the Python ecosystem in that area. So the next things I would actually like to do,

34:41

but as you know, time's always a limit. I would actually like to bring this tool set in a more reusable state so that it's easier to use for people who are not involved in developing that scripts. But also I think it would be possible to extend C Python in some areas

35:02

to make it even easier to build such kind of tool sets. So for example, the C API for thread names are one thing which would be super helpful. Every time I read a system tab report, I had to map the thread identifier to a human understandable name,

35:22

otherwise you have no chance with 31 threads. Also, one question I had in my mind is, would it be possible to actually move this metrics collection directly in C Python, like the profiler itself? It don't have to be active all the time,

35:40

but maybe we can enable it on demand if we actually need it and collect the statistics and get them out. If we have such an API, it's probably also easier to integrate that in existing observability tools. For example, distributed tracing, which shows, yes, this operation

36:00

actually took 700 milliseconds, but overall I only consumed 30 milliseconds of the GIL. And at that point in time, you would already know, okay, probably I have a problem with my GIL. If you are interested in that topic, and I hope there are some people interested because the room is quite full, then please watch out for me.

36:22

I would actually like to talk about that. I would like to hear some feedback, things that could be improved or things which would be interesting for you, because with that feedback, I think we could actually start some things in that area to make performance measurements in Python much better.

36:41

And I mean, yes, we have the GIL, we've probably never really removed the GIL from CPython, and because it is not so easy, otherwise people would already do that. But if we learn to live with the GIL, then we can also improve overall Python application performance,

37:03

and we don't have to rewrite them in, I don't know, C++ or C or Assembler, if you really like to do that. Okay, then thank you very much. And as we have five minutes, I think we can do one or two questions. Thank you. Thank you.