Adopting continuous-profiling: Understand how your code utilizes cpu/memory
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 542 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/61931 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSDEM 2023404 / 542
2
5
10
14
15
16
22
24
27
29
31
36
43
48
56
63
74
78
83
87
89
95
96
99
104
106
107
117
119
121
122
125
126
128
130
132
134
135
136
141
143
146
148
152
155
157
159
161
165
166
168
170
173
176
180
181
185
191
194
196
197
198
199
206
207
209
210
211
212
216
219
220
227
228
229
231
232
233
236
250
252
256
258
260
263
264
267
271
273
275
276
278
282
286
292
293
298
299
300
302
312
316
321
322
324
339
341
342
343
344
351
352
354
355
356
357
359
369
370
372
373
376
378
379
380
382
383
387
390
394
395
401
405
406
410
411
413
415
416
421
426
430
437
438
440
441
443
444
445
446
448
449
450
451
458
464
468
472
475
476
479
481
493
494
498
499
502
509
513
516
517
520
522
524
525
531
534
535
537
538
541
00:00
Continuous functionCodeRootMathematical analysisReduction of orderNegative numberMomentumScale (map)Metric systemPhysical systemCodeLibrary (computing)Multiplication signFile formatTimestampLine (geometry)Login1 (number)Error messageSelectivity (electronic)Decision theoryQuery languageComplex (psychology)Cache (computing)QuicksortBitService (economics)CurvatureFocus (optics)State observerFormal languageState of matterProduct (business)Event horizonRight angleBootingCausalityLetterpress printingDataflowPoint cloudRootTracing (software)Cartesian coordinate systemSoftware engineeringProfil (magazine)Bit rateDatabaseCASE <Informatik>IntegerComputer architectureHidden Markov modelUniform resource locatorElectronic mailing listSpacetimeWeb 2.0
07:13
Mathematical optimizationComa BerenicesDemo (music)CurvatureMeasurementGraph (mathematics)Metric systemProfil (magazine)Computer programmingTracing (software)Line (geometry)Control flowService (economics)Table (information)BefehlsprozessorCycle (graph theory)Semiconductor memoryFunctional (mathematics)CodeStack (abstract data type)MeasurementLibrary (computing)Run time (program lifecycle phase)Integrated development environmentInformationPerfect groupGraph (mathematics)Right angleComputer configurationUniform resource locatorConnectivity (graph theory)CASE <Informatik>BitSummierbarkeitRow (database)NumberVisualization (computer graphics)Multiplication signGraph coloringFormal language
14:20
MIDIParameter (computer programming)Interface (computing)Visualization (computer graphics)Equals signMenu (computing)Graph (mathematics)Query languageInformationKernel (computing)SpeicheradresseSymbol tableComputer programmingMemory managementBitCodeFormal languageGreen's functionCuboidMultiplicationProjective planePolarization (waves)BenchmarkCybersexLibrary (computing)Profil (magazine)BefehlsprozessorMultiplication signDifferent (Kate Ryan album)Row (database)2 (number)StatisticsSource codeOverhead (computing)Set (mathematics)1 (number)Sampling (statistics)AreaIntegerHookingRepresentation (politics)Semiconductor memoryCycle (graph theory)Capability Maturity ModelFreewareLine (geometry)Software developerCartesian coordinate systemPhysical systemCountingDemo (music)Prime idealComputer fileView (database)Server (computing)Selectivity (electronic)
21:27
VolumeView (database)Read-only memoryDemo (music)Slide ruleFormal grammarUniform resource locatorInstallable File SystemGoogolProgram flowchart
21:52
Binary fileRead-only memoryEmbedded systemTotal S.A.Gamma functionSynchronizationGame theoryGraphic designLemma (mathematics)Fermat's Last TheoremNormal (geometry)Lie groupMultiplication signConnectivity (graph theory)Selectivity (electronic)Resource allocationGraph (mathematics)CuboidGraph coloringZoom lensRevision controlProfil (magazine)Product (business)Slide ruleBenchmarkDecision theoryFormal languageFunctional (mathematics)Term (mathematics)Query languageInstance (computer science)BefehlsprozessorSemiconductor memoryLine (geometry)Computer programming2 (number)Table (information)Type theoryLaptopSoftware bugModule (mathematics)Prime numberControl flowCompilerCodePoint (geometry)Range (statistics)Computer animation
26:41
View (database)Slide ruleFormal grammarGraph (mathematics)Thread (computing)Tracing (software)Analytic continuationINTEGRALProfil (magazine)SoftwareLine (geometry)Machine codeTheoryConstructor (object-oriented programming)Functional (mathematics)RootCASE <Informatik>Term (mathematics)ScalabilityInheritance (object-oriented programming)Computer virusMultiplication signBefehlsprozessorBitCodeFile formatMetric systemComputer architectureEndliche ModelltheorieCartesian coordinate systemPlanningGateway (telecommunications)Similarity (geometry)Symbol tableBenchmarkMathematical optimizationConnected spaceSystem callOpen setProgram flowchart
31:56
Program flowchart
Transcript: English(auto-generated)
00:09
Yeah, hi everyone. My name is Christian Simon, and I'm gonna be talking about continuous profiling so we heard a lot about observability today already and
00:21
I'm gonna want to introduce maybe an additional signal there So maybe quickly about me, so I'm working at Grafana labs I'm a software engineer there and worked on our yeah databases for for observability So I worked on cortex now I
00:41
Worked on Loki and most recently I switched to the flat team and I'm 50% of the flat team And we kind of work on continuous profiling database There's not going to be a particular focus on flat today. So Basically what I want to talk today is kind of introduce how it's measured
01:02
What you can achieve with it and then maybe as I learn more at the next first time I can go more into detail and Yeah, like look at very specific language is there So When we talk about observability like what our common goals
01:22
Like obviously we want to ensure that the the user journeys of our users are successful That we maybe even can be proactively find problems Before A user note to notice is it and like basically like We want to be as quickly as possible when those problems happen to disrupt less of those user journeys and
01:46
observability provides us like an objective way of Getting some insights into the state of our system in production And even after a certain event has happened. We found the right way
02:00
Reboot and it's all up again I think it might be able to help us understanding what exactly happened when we want to figure out the root cause of something so One of the I guess easiest and probably oldest Observability signals is logs. So like I guess it starts with a kind of print
02:23
Hello, somewhere in your code and I guess you probably don't need a show of hands who's using it Like I guess everyone somehow uses logs or is asleep They don't show a hand so basically like Your application doesn't need any specific SDK It can probably just log based on the standard library of your languages one of the challenges
02:47
most of the time the format is Yeah, rather varied. So like even in terms of timestamps. It can be really really hard to get a common Understanding of of your log lines there and also like
03:02
When you then want to aggregate them, they are quite costly. So like it's bytes. You need to kind of convert them to in floats and so on and Also that something that can happen is that like you have so many logs that you can't find the ones That you actually interested in so like a grab error, for example could be
03:22
Yeah could be Maybe helpful, but also like there might be just too many errors and You kind of lose the important ones So also like if you want to learn more about logs my colleagues Owen and Kavi they're gonna speak about Loki. So definitely stay in the room and
03:43
I'm gonna move on to the to the next signal. So metrics I'm also assuming pretty much everyone has come across them and is using them. So In this case you kind of avoid that problem I mentioned before you have the actual integers exposed you maybe know about those integers that you care about them. So
04:02
to to get a metric most of the time you have to do some some kind of define a list of metrics you care About and then you can collect them so it might be you might be having kind of an outage and didn't have that metric that you care about and so you need to kind of constantly improve on on the the exposure of the
04:21
metrics obviously like Prometheus is the the kind of main tool and in that space and Very often we talk about web services, I guess when we when we think about those applications So the red method so like get the rates of your requests get the error rate of your request and get the latency
04:42
duration of the request can already cover quite a lot of cases and obviously like as it's kind of just Integers or floats you can aggregate them quite efficiently across like yeah, a multitude of pods or like a really huge setup of services and
05:04
Then if you get more into that kind of micro services architecture that has kind of evolved over the last couple of years You will find yourself kind of having a really complex Composition of services being involved in answering requests and so like You might even struggle to understand what's slowing you down or where the error is coming from. Why do I have this time out here and
05:27
Distributed tracing can help you a lot with kind of getting an understanding what your service is doing It also might be that you like maybe the service is actually doing way too much and you calculating things over and over again so that is super helpful to to get a bit more like
05:44
the the kind of flow of the data in your system so like the The challenge there might be like you might have a lot of requests and that's while it's somewhat cheap to To get the tracing you might not cover all the requests. So for example now production system
06:03
I think maybe someone needs to correct me if I'm wrong, but like we When we receive us Grafana cloud like logs and metric data We only cover 1% of those for traces while we cover 100% of our queries so like
06:21
If you need to make a selective decision If it's worth investing that like obviously locks data looks always the same It comes every second and and so on so like it we see more value in having all of those queries where there's a complex kind of caching and All sorts of systems interacting with it and and so that allows us yeah to look a bit deeper and even then find
06:46
That one service that maybe it's the the bottleneck there. So maybe look looking at a bit of a real problem so I'm having an online shop, I'm selling socks and a user is complaining about Getting some time out when wanting to check out. That's obviously not great because I'm not selling the socks
07:06
But at least the user got like some trace ID and isn't complaining to our customer service Then starting from there. I might figuring out. Hmm. It's the location service that actually was was the one that cost the time out in the end and
07:22
Then like looking at the metrics of the location service I might find oh like there's actually 5% of the requests timing out. So maybe 5% of my users are not Able to to buy their socks monthly or whatever and so what are the next steps I guess Scaling up is always good. So like maybe the service is just overloaded
07:44
The person that wrote it left years ago, so we have no idea So we just scale it up. Obviously it comes with a cost. And so I think always the first thing would be Fixing the bleed making sure no more time out So scaling up is definitely the right option here
08:01
But then if you do that over years You might suddenly find yourself having a lot more extra costs attached to their location service And so that's kind of where I think We need another signal and I think that signal should be profiling. So I guess most people might have come across profiling and
08:22
It basically measures your code and how long it executes for example, or what kind of bytes it allocates in memory and It basically helps you maybe understand your program even more or someone else's program in the location service case
08:44
So and that eventually can translate in like cost savings if you find out where the problem lies Like you can maybe fix it or can get some ideas maybe you can also look at the fix and see if it's gotten actually worse or better and and yeah, like that basically like
09:02
gives you a better understanding of how your code behaves and so Another question is how is like kind of actually what is actually measured in a profile? So I created like a bit of a Program, I don't know. I hope everyone can see it So it's basically like a program that has a main function and then calls out other functions
09:24
So you can see like there's a do a lot and there's a do little function and both of them then called prepare So and obviously in their comments, there's some work going on and obviously the work could be Allocating memory like using the CPU something like that. So let's say we use CPU
09:44
So when the Function starts like we are first going to do something within the main Let's say we spend three CPU cycles, which is not a lot but that then gets recorded like yes We took us three CPU cycles in main we then go into the prepare method through do a lot
10:03
then we spend another five CPU cycles and those kind of stack traces then they are recorded in the profile and Going Through the program like we will end up with that kind of measurement of stack traces and while it kind of works with
10:21
Ten lines of code you can maybe spot where the problem is like that's the 20 and do a lot it definitely like kind of breaks down when you're speaking about like a lot of services or like a lot of code base that this happened or happens to be hot and Actively used and so like there are a couple of ways of visualizing them
10:42
I think one of the the first things you would find is kind of a top table. So Like in that table you can order it by Different values like so this is kind of an example from P prof like which is kind of the go to
11:00
but and you can see kind of clearly do a lot is the It's the method that that comes out on top and like Then aren't now different ways how you can look at the value So you have the flat count? Which is the function itself only so you can see the 20 that we had before 20 CPU cycles But we also have the cumulative
11:22
Measurement which also includes to prepare that is gonna get caught from do a lot and So like we already can see we spent 52% of our program and do a lot. So Maybe We already can stop looking at the tail and just look at do a lot because if we fix do a lot or get rid of it We need half of the CPU less and that's that kind of it's kind of represented by the sum
11:46
So the sum will always change depending on how you order the table in that particular example. So in this case Like we have 100% already at row number four because we only have four functions. So and
12:00
To get a bit more of a visual sense of what's going on there are the so-called flame graphs and I think the the most confusing thing for me about them was the coloring. So obviously like red is always not great Should we look at main? No, we shouldn't so basically like
12:21
The coloring on I think is random or uses some kind of hashing and basically is only meant to look like a flame So like the red here doesn't mean anything so like if you colorblind that's perfect for flame graphs So what we actually want to look at is this kind of
12:41
like at the leaf end It's basically where the program would spend things like you can see the three CPU cycles main here So there's nothing below so main uses 100% through methods That are called by main and then there's nothing beyond this here So like here we spend something in main and in the same way and do little we can see the five and
13:04
then do a lot we can see the 20 quite wide and then the Prepares with five each as well and now obviously if you look across a really huge program You basically like can can spot kind of What's going on quite quickly and then if you have like similar with like root and main
13:24
Like you basically can ignore that but it helps you maybe locate which component of your of your program you want to look at because like maybe You're not good at naming and you call everything prepare and util and and it would still tell you roughly where it gets cold and How the program gets there?
13:46
So, how do we get that profile and that can be kind of quite like that can be quite a lot of challenges how to get that so I think I would say like there that's like roughly two ways like either your Ecosystem supports kind of profiles fairly natively and then you
14:04
instrument the profile you added maybe a library and an SDK and Like basically, like the runtime within your Environment will maybe expose the information so like it's not available for all languages
14:22
There's I guess a lot of work going on that it becomes more and more available and The kind of other approach is more like through an agent and eBPF has been quite hyped I'm I'm not very familiar with the eBPF myself. I have used the agents, but haven't written any code
14:41
But basically what if you would use it would use an outside view of it So you wouldn't need to change the binary running really like you would just Yeah, kind of look at the information you get from the Linux kernel like you hook into I don't know Often enough when the CPU runs to then find out
15:00
What is currently running? So there are different Languages like for example in a compiled Language you would be having a lot more information the memory memory addresses are the same you can kind of use the information within the Symbol table to figure out where your program is and and what is currently running
15:24
In like I don't know like an interpreted languages like Ruby Python This might be a bit harder and that information might not be accessible to the kernel without further work Like it also when you compile you might drop those symbol tables like so that
15:40
You really need to kind of be preparing your application a bit for for that And then basically like yeah I'm gonna look into the kind of Prime example, I'm most familiar with it. So I'm mostly a go developer over the last couple of years, so
16:02
Go has quite quite a kind of mature set of tools and in that area So basically the cyber the standard library allows you to expose that information It supports like CPU and memory And especially garbage collected languages memory is quite a thing to like also non garbage collected languages, but
16:24
memory is really important to to understand the usage there I Have a quick quick example of a program like so you basically like just expose an HTTP port Where you can download the profile whenever you want and you have that
16:40
P prof tool that you can point to it So like in the and that kind of first line example, you would just get a two second long profile so CPU profile that looks at the CPU for two minute for two seconds and and basically records like whatever is running how long the CPU and then you get the file and
17:00
P profile allow you to visualize it through through that top table, for example So What I forgot to mention as well So later you can maybe go to that URL and look at that profile that I had it in as an example and In the same way you can get the the memory allocations and
17:22
P prof also allows you to launch an HTTP server to be a bit more interactive and select certain code paths So that is like quite a lot in the go docs go.dev about profiling so I Definitely leave it there
17:40
So you can also look at kind of maybe if you are go developer like use that and play around yourself But now I want to speak about why Profiling might be actually quite difficult So the example I had like I had three CPU cycles and if you think about that's not very much So and just to record what the program was doing in those free CPU cycle probably takes I have no idea
18:06
But thousands of few cycles and so you really want to be careful. What do you want to record? So If you really would record all of that like your program would probably have like an a massive overhead would slow down by profile and behave completely different and
18:24
You also would have a lot more data to store to analyze and then if you think about microservices and replica count 500 you you might get quite quite a lot of data that is actually not that useful to you because you are you really caring about free CPU cyclists probably not and and because of that
18:47
to allow continuous profiling so to do that in production across like a wide set of deployments Like Think Google were the first ones to do that and they were starting to sample those profiling
19:02
So instead of looking at really every code that runs Go, for example looks 100 times a second. What is currently running on the CPU and then records it and obviously maybe like integer adder will not be on the CPU if you don't run it all the time and so you get a really accurate
19:24
representation what it's really taking Your CPU time and the way that works. You also need to be kind of aware that like some things the actual program might not be on the CPU because it might be waiting for IO and so like when you kind of collect the profile and the profile is
19:41
Not having that many seconds you really need to think about is this really what I want to optimize or maybe I'm not seeing what? I'm actually want to see With that kind of statistical approach like Like I don't have any kind of
20:00
Sources to say but like I think generally you you say that it's like a two to three percent Overhead that gets added on top of your program execution so that's I guess a lot more reasonable than the the full approach with recording everything and So what do you generally kind of would do obviously who you first need to ship your application somewhere and run it
20:22
Then you can look at the profiles And yeah, think about it. Look at it Like maybe you you are the owner of that code Maybe you have a bit more understanding and those profiles maybe can give you a bit more of an idea of what you're actually doing there or how the system is reacting to to your code and
20:43
So basically like for that green box multiple solutions exist so I'm obviously a bit biased, but I also have to say our project is fairly young and evolving So for example, there's like CNCF pixie eBPF based
21:00
There's a polar signals Parker like people are in the room At Pyroscope and and kind of our solution I think they're all great like you can all use them and start using them and exploring Like maybe just your benchmarks for a start and then as you get more familiar with it, like you might kind of
21:21
Discover more and more of the of the value there so I'm Still gonna use flare now for my quick demo so Let me just see
21:40
Most of you are kind of familiar with Grafana and explore Why is it so huge And so basically that's kind of the the entry point you will be gonna see in the explore You have the kind of typical time range selection. So
22:04
Let's say we want to see the last 15 minutes now and here we can see their profile types collected So that's just a docker compose running locally on my laptop, hopefully Running locally on my laptop since I started to talk and For example, we can here look at the CPU. So that's kind of this
22:22
nanosecond spender on the CPU and you can kind of see The flame graph from earlier and maybe some bug I don't know it looks a bit bigger than it usually should be but We can see that kind of top table. We can see the aggregation of all the services So I'm running like five ports or something like that different languages
22:43
so you can see like For example, this here is like a Python main module. It's doing some prime numbers So what I first want to kind of break down here is by label and that's really the only Kind of functionality that we have in terms of querying. So here we would look at the different instances and
23:05
We We kind of see the CPU time spent Like I don't know that there's like a rust port and they are both blue So, I don't know which switch but I guess flare is doing more. So that's that might be the the flare one and
23:21
for my Example now I want to look at Yeah, just like a small program that I wrote to to show like how Like the aspect so like here we can see kind of the timeline So this is like a profile gets collected I think every 15 seconds and that's basically a dot and then the flame graph and the top table below
23:44
Would kind of just aggregate that so like there's no time component in here. That's also important to understand and so Like while we were looking at memory, I'm now gonna kind of switch to the allocated space
24:02
And And here we have some label selection like that you might be familiar And this random pot here you can see like the allocation. So the amount of memory that gets allocated It's like like around six megabytes, but then
24:22
every couple of every five minutes roughly you can see like some peak and So if you already look in the flame graph, there's already some kind of Big red box and the colors don't matter But basically like you can see this this kind of piece of code is doing kind of a majority of the allocations
24:46
and now you could even kind of zoom in here if you really want to figure out and then it even gets bigger and you can see some more what's going on and So now If we actually want to look at the kind of code
25:03
for this and if flare is maybe in version 0.6, we could even see the line of code that we should look at right now you can't but basically like
25:20
It would show us allocations in line 21 and I guess most of you can see what this kind of program is doing so every five minutes it will kind of have some peak of Allocations and you only see that kind of because you have the time component you can select them and then see the flame graph aggregation cool
25:40
Yeah, that was almost my talk Like I have one more slide that I just quickly wanna So in the latest go version 120 there is profile guided optimizations, and I think that might be a really big topic So what it does it looks at kind of your profile and that can come from production from benchmarks from whatever and
26:01
tries to do the better decisions during compile time of What things to do with your code like for example, I think the only thing that it does right now is making inlining decisions But basically like if it sees this method is called a lot and is in the hot path It would then make the decisions to inline that method. Maybe if it's a bit colder. It would not do it and
26:24
You can be a lot more accurate as a compiler if you have that kind of data If you know that method is in the hot path or not. Okay, that was it. Hey, thanks a lot. That was awesome
26:42
questions Thank you Thank you for the talk I'm just wondering how would the profiling work with very multi-threaded code? Is their ability to drill down into that level? Yeah, so like maybe
27:01
So like in terms of multi-threading like obviously we we only have the main method in that example So you can see root and then mine is 100% and like if it's multi-threaded you would have kind of maybe more So it's basically all only the stack trace that that gets recorded like you You would not see kind of maybe the connections where the thread where it's spreading off and things like that
27:23
You would get the the stack trace call stack Have you looked into any other profiling formats then? Be prof ingestion. I know open telemetry has been doing some stuff about introducing a profiling Format that people can standardize them, but I don't know if you've looked at that at all
27:44
Yes Can you like I haven't seen you like Sorry, can you can you repeat like I I struggle to so I was wondering if you've looked at any other profiling ingestion formats Other than people know like I or so like right now we use people personally with a bit of a flare
28:05
So I think there's a lot of kind of improvements to be had over over the the format there And that's like as far as I know some active work around open telemetry To to come to I guess a better format in the sense to not send symbols over and over again and reduce interest
28:22
But not know the accurate and short answer So thank you for a talk. And my question is that looking at the flare architecture It's currently a pool model So the flare agent is scraping the profiling data from the applications that they configure it to scrape my question is
28:41
Is there an eventual plan to also add maybe a push gateway or similar API for applications where this might be suitable? Yeah, like definitely like I think I also can see kind of the push use case for maybe if you want to get Your micro benchmarks from CI CD in Like the API in theory allows it but tooling is missing
29:01
But I definitely think it's a valid like push use case as well. I think in terms of scalability I think pool will be better but Yeah, I agree Thanks for the talk. I have small question Did you try to implement this
29:23
Tooling in the end of the CI CD and co continuous optimization No, like so we're not using it yet for that. I think it's it's definitely a super useful thing because like Yeah, like you you want to see maybe how a pull request behaves like maybe how your application allocates more or less in
29:43
Different parts and and if the trade-offs are right there But yeah, I think it definitely can and should be used for that. But no no tooling right now Mm-hmm. Mm-hmm. Yeah. No, I fully agree as well. Yeah
30:05
Hello. Thank you. So if I understand correctly profiles such as Traces combined with OS metrics, right? So what is at a concrete? Specific time you can see how much CPU used and so on right?
30:21
Yeah, I guess it looks a bit more at the actual line of code rather than I don't know I haven't used like tracing where it automatically finds the function. Maybe that also tells you the line of code, but Yeah, like it it definitely adds some metrics to it like without you doing much
30:40
I guess all of them making sure it can read the the symbol tables and the function names Yeah, so so I just had like a dumb question or like dumb idea Couldn't you just combine for example, you already have node exporter which exposes metrics at all times So you have all OS metrics and you have traces for example, so
31:02
Couldn't you just have some kind of integration graphana or somewhere else that just combines traces with matter? Yes Like I think that was also the like like people that work longer at continuous profiling software that they try to kind of reuse Kind of Prometheus and I think where you end up kind of in it's just a very high cardinality
31:20
It's too many lines of codes and and that's kind of where it stops But like in theory like I guess most prompt UL Constructs and functions are maybe something we need to implement on top of that in a similar way because yet you just get metrics Out of it and so basically the problem was too many lines of code too much changing over time and
31:43
Like you just get too much serious churn through that So, thanks a lot