We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Adopting continuous-profiling: Understand how your code utilizes cpu/memory

00:00

Formal Metadata

Title
Adopting continuous-profiling: Understand how your code utilizes cpu/memory
Subtitle
Introduction into continuous-profiling and how it can help you writing more efficient code
Title of Series
Number of Parts
542
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
With the popularity of observability tooling to analyze Logs, Metrics and Traces, it has become easier than ever to find the bottleneck in your software stack. Once you have identified the particular system introducing a user facing performance degradation, as its developer you need to understand which part (ideally down to the function and line of code) is slowing it down. With that insight you are able to effectively optimize your application. In this talk I will show how profiles are collected, how they can be aggregated and visualized. And then how those insights can be used to optimize your code. While there is a focus on the Go ecosystem, most of the content of the talk should be transferable to other languages
14
15
43
87
Thumbnail
26:29
146
Thumbnail
18:05
199
207
Thumbnail
22:17
264
278
Thumbnail
30:52
293
Thumbnail
15:53
341
Thumbnail
31:01
354
359
410
Continuous functionCodeRootMathematical analysisReduction of orderNegative numberMomentumScale (map)Metric systemPhysical systemCodeLibrary (computing)Multiplication signFile formatTimestampLine (geometry)Login1 (number)Error messageSelectivity (electronic)Decision theoryQuery languageComplex (psychology)Cache (computing)QuicksortBitService (economics)CurvatureFocus (optics)State observerFormal languageState of matterProduct (business)Event horizonRight angleBootingCausalityLetterpress printingDataflowPoint cloudRootTracing (software)Cartesian coordinate systemSoftware engineeringProfil (magazine)Bit rateDatabaseCASE <Informatik>IntegerComputer architectureHidden Markov modelUniform resource locatorElectronic mailing listSpacetimeWeb 2.0
Mathematical optimizationComa BerenicesDemo (music)CurvatureMeasurementGraph (mathematics)Metric systemProfil (magazine)Computer programmingTracing (software)Line (geometry)Control flowService (economics)Table (information)BefehlsprozessorCycle (graph theory)Semiconductor memoryFunctional (mathematics)CodeStack (abstract data type)MeasurementLibrary (computing)Run time (program lifecycle phase)Integrated development environmentInformationPerfect groupGraph (mathematics)Right angleComputer configurationUniform resource locatorConnectivity (graph theory)CASE <Informatik>BitSummierbarkeitRow (database)NumberVisualization (computer graphics)Multiplication signGraph coloringFormal language
MIDIParameter (computer programming)Interface (computing)Visualization (computer graphics)Equals signMenu (computing)Graph (mathematics)Query languageInformationKernel (computing)SpeicheradresseSymbol tableComputer programmingMemory managementBitCodeFormal languageGreen's functionCuboidMultiplicationProjective planePolarization (waves)BenchmarkCybersexLibrary (computing)Profil (magazine)BefehlsprozessorMultiplication signDifferent (Kate Ryan album)Row (database)2 (number)StatisticsSource codeOverhead (computing)Set (mathematics)1 (number)Sampling (statistics)AreaIntegerHookingRepresentation (politics)Semiconductor memoryCycle (graph theory)Capability Maturity ModelFreewareLine (geometry)Software developerCartesian coordinate systemPhysical systemCountingDemo (music)Prime idealComputer fileView (database)Server (computing)Selectivity (electronic)
VolumeView (database)Read-only memoryDemo (music)Slide ruleFormal grammarUniform resource locatorInstallable File SystemGoogolProgram flowchart
Binary fileRead-only memoryEmbedded systemTotal S.A.Gamma functionSynchronizationGame theoryGraphic designLemma (mathematics)Fermat's Last TheoremNormal (geometry)Lie groupMultiplication signConnectivity (graph theory)Selectivity (electronic)Resource allocationGraph (mathematics)CuboidGraph coloringZoom lensRevision controlProfil (magazine)Product (business)Slide ruleBenchmarkDecision theoryFormal languageFunctional (mathematics)Term (mathematics)Query languageInstance (computer science)BefehlsprozessorSemiconductor memoryLine (geometry)Computer programming2 (number)Table (information)Type theoryLaptopSoftware bugModule (mathematics)Prime numberControl flowCompilerCodePoint (geometry)Range (statistics)Computer animation
View (database)Slide ruleFormal grammarGraph (mathematics)Thread (computing)Tracing (software)Analytic continuationINTEGRALProfil (magazine)SoftwareLine (geometry)Machine codeTheoryConstructor (object-oriented programming)Functional (mathematics)RootCASE <Informatik>Term (mathematics)ScalabilityInheritance (object-oriented programming)Computer virusMultiplication signBefehlsprozessorBitCodeFile formatMetric systemComputer architectureEndliche ModelltheorieCartesian coordinate systemPlanningGateway (telecommunications)Similarity (geometry)Symbol tableBenchmarkMathematical optimizationConnected spaceSystem callOpen setProgram flowchart
Program flowchart
Transcript: English(auto-generated)
Yeah, hi everyone. My name is Christian Simon, and I'm gonna be talking about continuous profiling so we heard a lot about observability today already and
I'm gonna want to introduce maybe an additional signal there So maybe quickly about me, so I'm working at Grafana labs I'm a software engineer there and worked on our yeah databases for for observability So I worked on cortex now I
Worked on Loki and most recently I switched to the flat team and I'm 50% of the flat team And we kind of work on continuous profiling database There's not going to be a particular focus on flat today. So Basically what I want to talk today is kind of introduce how it's measured
What you can achieve with it and then maybe as I learn more at the next first time I can go more into detail and Yeah, like look at very specific language is there So When we talk about observability like what our common goals
Like obviously we want to ensure that the the user journeys of our users are successful That we maybe even can be proactively find problems Before A user note to notice is it and like basically like We want to be as quickly as possible when those problems happen to disrupt less of those user journeys and
observability provides us like an objective way of Getting some insights into the state of our system in production And even after a certain event has happened. We found the right way
Reboot and it's all up again I think it might be able to help us understanding what exactly happened when we want to figure out the root cause of something so One of the I guess easiest and probably oldest Observability signals is logs. So like I guess it starts with a kind of print
Hello, somewhere in your code and I guess you probably don't need a show of hands who's using it Like I guess everyone somehow uses logs or is asleep They don't show a hand so basically like Your application doesn't need any specific SDK It can probably just log based on the standard library of your languages one of the challenges
most of the time the format is Yeah, rather varied. So like even in terms of timestamps. It can be really really hard to get a common Understanding of of your log lines there and also like
When you then want to aggregate them, they are quite costly. So like it's bytes. You need to kind of convert them to in floats and so on and Also that something that can happen is that like you have so many logs that you can't find the ones That you actually interested in so like a grab error, for example could be
Yeah could be Maybe helpful, but also like there might be just too many errors and You kind of lose the important ones So also like if you want to learn more about logs my colleagues Owen and Kavi they're gonna speak about Loki. So definitely stay in the room and
I'm gonna move on to the to the next signal. So metrics I'm also assuming pretty much everyone has come across them and is using them. So In this case you kind of avoid that problem I mentioned before you have the actual integers exposed you maybe know about those integers that you care about them. So
to to get a metric most of the time you have to do some some kind of define a list of metrics you care About and then you can collect them so it might be you might be having kind of an outage and didn't have that metric that you care about and so you need to kind of constantly improve on on the the exposure of the
metrics obviously like Prometheus is the the kind of main tool and in that space and Very often we talk about web services, I guess when we when we think about those applications So the red method so like get the rates of your requests get the error rate of your request and get the latency
duration of the request can already cover quite a lot of cases and obviously like as it's kind of just Integers or floats you can aggregate them quite efficiently across like yeah, a multitude of pods or like a really huge setup of services and
Then if you get more into that kind of micro services architecture that has kind of evolved over the last couple of years You will find yourself kind of having a really complex Composition of services being involved in answering requests and so like You might even struggle to understand what's slowing you down or where the error is coming from. Why do I have this time out here and
Distributed tracing can help you a lot with kind of getting an understanding what your service is doing It also might be that you like maybe the service is actually doing way too much and you calculating things over and over again so that is super helpful to to get a bit more like
the the kind of flow of the data in your system so like the The challenge there might be like you might have a lot of requests and that's while it's somewhat cheap to To get the tracing you might not cover all the requests. So for example now production system
I think maybe someone needs to correct me if I'm wrong, but like we When we receive us Grafana cloud like logs and metric data We only cover 1% of those for traces while we cover 100% of our queries so like
If you need to make a selective decision If it's worth investing that like obviously locks data looks always the same It comes every second and and so on so like it we see more value in having all of those queries where there's a complex kind of caching and All sorts of systems interacting with it and and so that allows us yeah to look a bit deeper and even then find
That one service that maybe it's the the bottleneck there. So maybe look looking at a bit of a real problem so I'm having an online shop, I'm selling socks and a user is complaining about Getting some time out when wanting to check out. That's obviously not great because I'm not selling the socks
But at least the user got like some trace ID and isn't complaining to our customer service Then starting from there. I might figuring out. Hmm. It's the location service that actually was was the one that cost the time out in the end and
Then like looking at the metrics of the location service I might find oh like there's actually 5% of the requests timing out. So maybe 5% of my users are not Able to to buy their socks monthly or whatever and so what are the next steps I guess Scaling up is always good. So like maybe the service is just overloaded
The person that wrote it left years ago, so we have no idea So we just scale it up. Obviously it comes with a cost. And so I think always the first thing would be Fixing the bleed making sure no more time out So scaling up is definitely the right option here
But then if you do that over years You might suddenly find yourself having a lot more extra costs attached to their location service And so that's kind of where I think We need another signal and I think that signal should be profiling. So I guess most people might have come across profiling and
It basically measures your code and how long it executes for example, or what kind of bytes it allocates in memory and It basically helps you maybe understand your program even more or someone else's program in the location service case
So and that eventually can translate in like cost savings if you find out where the problem lies Like you can maybe fix it or can get some ideas maybe you can also look at the fix and see if it's gotten actually worse or better and and yeah, like that basically like
gives you a better understanding of how your code behaves and so Another question is how is like kind of actually what is actually measured in a profile? So I created like a bit of a Program, I don't know. I hope everyone can see it So it's basically like a program that has a main function and then calls out other functions
So you can see like there's a do a lot and there's a do little function and both of them then called prepare So and obviously in their comments, there's some work going on and obviously the work could be Allocating memory like using the CPU something like that. So let's say we use CPU
So when the Function starts like we are first going to do something within the main Let's say we spend three CPU cycles, which is not a lot but that then gets recorded like yes We took us three CPU cycles in main we then go into the prepare method through do a lot
then we spend another five CPU cycles and those kind of stack traces then they are recorded in the profile and Going Through the program like we will end up with that kind of measurement of stack traces and while it kind of works with
Ten lines of code you can maybe spot where the problem is like that's the 20 and do a lot it definitely like kind of breaks down when you're speaking about like a lot of services or like a lot of code base that this happened or happens to be hot and Actively used and so like there are a couple of ways of visualizing them
I think one of the the first things you would find is kind of a top table. So Like in that table you can order it by Different values like so this is kind of an example from P prof like which is kind of the go to
but and you can see kind of clearly do a lot is the It's the method that that comes out on top and like Then aren't now different ways how you can look at the value So you have the flat count? Which is the function itself only so you can see the 20 that we had before 20 CPU cycles But we also have the cumulative
Measurement which also includes to prepare that is gonna get caught from do a lot and So like we already can see we spent 52% of our program and do a lot. So Maybe We already can stop looking at the tail and just look at do a lot because if we fix do a lot or get rid of it We need half of the CPU less and that's that kind of it's kind of represented by the sum
So the sum will always change depending on how you order the table in that particular example. So in this case Like we have 100% already at row number four because we only have four functions. So and
To get a bit more of a visual sense of what's going on there are the so-called flame graphs and I think the the most confusing thing for me about them was the coloring. So obviously like red is always not great Should we look at main? No, we shouldn't so basically like
The coloring on I think is random or uses some kind of hashing and basically is only meant to look like a flame So like the red here doesn't mean anything so like if you colorblind that's perfect for flame graphs So what we actually want to look at is this kind of
like at the leaf end It's basically where the program would spend things like you can see the three CPU cycles main here So there's nothing below so main uses 100% through methods That are called by main and then there's nothing beyond this here So like here we spend something in main and in the same way and do little we can see the five and
then do a lot we can see the 20 quite wide and then the Prepares with five each as well and now obviously if you look across a really huge program You basically like can can spot kind of What's going on quite quickly and then if you have like similar with like root and main
Like you basically can ignore that but it helps you maybe locate which component of your of your program you want to look at because like maybe You're not good at naming and you call everything prepare and util and and it would still tell you roughly where it gets cold and How the program gets there?
So, how do we get that profile and that can be kind of quite like that can be quite a lot of challenges how to get that so I think I would say like there that's like roughly two ways like either your Ecosystem supports kind of profiles fairly natively and then you
instrument the profile you added maybe a library and an SDK and Like basically, like the runtime within your Environment will maybe expose the information so like it's not available for all languages
There's I guess a lot of work going on that it becomes more and more available and The kind of other approach is more like through an agent and eBPF has been quite hyped I'm I'm not very familiar with the eBPF myself. I have used the agents, but haven't written any code
But basically what if you would use it would use an outside view of it So you wouldn't need to change the binary running really like you would just Yeah, kind of look at the information you get from the Linux kernel like you hook into I don't know Often enough when the CPU runs to then find out
What is currently running? So there are different Languages like for example in a compiled Language you would be having a lot more information the memory memory addresses are the same you can kind of use the information within the Symbol table to figure out where your program is and and what is currently running
In like I don't know like an interpreted languages like Ruby Python This might be a bit harder and that information might not be accessible to the kernel without further work Like it also when you compile you might drop those symbol tables like so that
You really need to kind of be preparing your application a bit for for that And then basically like yeah I'm gonna look into the kind of Prime example, I'm most familiar with it. So I'm mostly a go developer over the last couple of years, so
Go has quite quite a kind of mature set of tools and in that area So basically the cyber the standard library allows you to expose that information It supports like CPU and memory And especially garbage collected languages memory is quite a thing to like also non garbage collected languages, but
memory is really important to to understand the usage there I Have a quick quick example of a program like so you basically like just expose an HTTP port Where you can download the profile whenever you want and you have that
P prof tool that you can point to it So like in the and that kind of first line example, you would just get a two second long profile so CPU profile that looks at the CPU for two minute for two seconds and and basically records like whatever is running how long the CPU and then you get the file and
P profile allow you to visualize it through through that top table, for example So What I forgot to mention as well So later you can maybe go to that URL and look at that profile that I had it in as an example and In the same way you can get the the memory allocations and
P prof also allows you to launch an HTTP server to be a bit more interactive and select certain code paths So that is like quite a lot in the go docs go.dev about profiling so I Definitely leave it there
So you can also look at kind of maybe if you are go developer like use that and play around yourself But now I want to speak about why Profiling might be actually quite difficult So the example I had like I had three CPU cycles and if you think about that's not very much So and just to record what the program was doing in those free CPU cycle probably takes I have no idea
But thousands of few cycles and so you really want to be careful. What do you want to record? So If you really would record all of that like your program would probably have like an a massive overhead would slow down by profile and behave completely different and
You also would have a lot more data to store to analyze and then if you think about microservices and replica count 500 you you might get quite quite a lot of data that is actually not that useful to you because you are you really caring about free CPU cyclists probably not and and because of that
to allow continuous profiling so to do that in production across like a wide set of deployments Like Think Google were the first ones to do that and they were starting to sample those profiling
So instead of looking at really every code that runs Go, for example looks 100 times a second. What is currently running on the CPU and then records it and obviously maybe like integer adder will not be on the CPU if you don't run it all the time and so you get a really accurate
representation what it's really taking Your CPU time and the way that works. You also need to be kind of aware that like some things the actual program might not be on the CPU because it might be waiting for IO and so like when you kind of collect the profile and the profile is
Not having that many seconds you really need to think about is this really what I want to optimize or maybe I'm not seeing what? I'm actually want to see With that kind of statistical approach like Like I don't have any kind of
Sources to say but like I think generally you you say that it's like a two to three percent Overhead that gets added on top of your program execution so that's I guess a lot more reasonable than the the full approach with recording everything and So what do you generally kind of would do obviously who you first need to ship your application somewhere and run it
Then you can look at the profiles And yeah, think about it. Look at it Like maybe you you are the owner of that code Maybe you have a bit more understanding and those profiles maybe can give you a bit more of an idea of what you're actually doing there or how the system is reacting to to your code and
So basically like for that green box multiple solutions exist so I'm obviously a bit biased, but I also have to say our project is fairly young and evolving So for example, there's like CNCF pixie eBPF based
There's a polar signals Parker like people are in the room At Pyroscope and and kind of our solution I think they're all great like you can all use them and start using them and exploring Like maybe just your benchmarks for a start and then as you get more familiar with it, like you might kind of
Discover more and more of the of the value there so I'm Still gonna use flare now for my quick demo so Let me just see
Most of you are kind of familiar with Grafana and explore Why is it so huge And so basically that's kind of the the entry point you will be gonna see in the explore You have the kind of typical time range selection. So
Let's say we want to see the last 15 minutes now and here we can see their profile types collected So that's just a docker compose running locally on my laptop, hopefully Running locally on my laptop since I started to talk and For example, we can here look at the CPU. So that's kind of this
nanosecond spender on the CPU and you can kind of see The flame graph from earlier and maybe some bug I don't know it looks a bit bigger than it usually should be but We can see that kind of top table. We can see the aggregation of all the services So I'm running like five ports or something like that different languages
so you can see like For example, this here is like a Python main module. It's doing some prime numbers So what I first want to kind of break down here is by label and that's really the only Kind of functionality that we have in terms of querying. So here we would look at the different instances and
We We kind of see the CPU time spent Like I don't know that there's like a rust port and they are both blue So, I don't know which switch but I guess flare is doing more. So that's that might be the the flare one and
for my Example now I want to look at Yeah, just like a small program that I wrote to to show like how Like the aspect so like here we can see kind of the timeline So this is like a profile gets collected I think every 15 seconds and that's basically a dot and then the flame graph and the top table below
Would kind of just aggregate that so like there's no time component in here. That's also important to understand and so Like while we were looking at memory, I'm now gonna kind of switch to the allocated space
And And here we have some label selection like that you might be familiar And this random pot here you can see like the allocation. So the amount of memory that gets allocated It's like like around six megabytes, but then
every couple of every five minutes roughly you can see like some peak and So if you already look in the flame graph, there's already some kind of Big red box and the colors don't matter But basically like you can see this this kind of piece of code is doing kind of a majority of the allocations
and now you could even kind of zoom in here if you really want to figure out and then it even gets bigger and you can see some more what's going on and So now If we actually want to look at the kind of code
for this and if flare is maybe in version 0.6, we could even see the line of code that we should look at right now you can't but basically like
It would show us allocations in line 21 and I guess most of you can see what this kind of program is doing so every five minutes it will kind of have some peak of Allocations and you only see that kind of because you have the time component you can select them and then see the flame graph aggregation cool
Yeah, that was almost my talk Like I have one more slide that I just quickly wanna So in the latest go version 120 there is profile guided optimizations, and I think that might be a really big topic So what it does it looks at kind of your profile and that can come from production from benchmarks from whatever and
tries to do the better decisions during compile time of What things to do with your code like for example, I think the only thing that it does right now is making inlining decisions But basically like if it sees this method is called a lot and is in the hot path It would then make the decisions to inline that method. Maybe if it's a bit colder. It would not do it and
You can be a lot more accurate as a compiler if you have that kind of data If you know that method is in the hot path or not. Okay, that was it. Hey, thanks a lot. That was awesome
questions Thank you Thank you for the talk I'm just wondering how would the profiling work with very multi-threaded code? Is their ability to drill down into that level? Yeah, so like maybe
So like in terms of multi-threading like obviously we we only have the main method in that example So you can see root and then mine is 100% and like if it's multi-threaded you would have kind of maybe more So it's basically all only the stack trace that that gets recorded like you You would not see kind of maybe the connections where the thread where it's spreading off and things like that
You would get the the stack trace call stack Have you looked into any other profiling formats then? Be prof ingestion. I know open telemetry has been doing some stuff about introducing a profiling Format that people can standardize them, but I don't know if you've looked at that at all
Yes Can you like I haven't seen you like Sorry, can you can you repeat like I I struggle to so I was wondering if you've looked at any other profiling ingestion formats Other than people know like I or so like right now we use people personally with a bit of a flare
So I think there's a lot of kind of improvements to be had over over the the format there And that's like as far as I know some active work around open telemetry To to come to I guess a better format in the sense to not send symbols over and over again and reduce interest
But not know the accurate and short answer So thank you for a talk. And my question is that looking at the flare architecture It's currently a pool model So the flare agent is scraping the profiling data from the applications that they configure it to scrape my question is
Is there an eventual plan to also add maybe a push gateway or similar API for applications where this might be suitable? Yeah, like definitely like I think I also can see kind of the push use case for maybe if you want to get Your micro benchmarks from CI CD in Like the API in theory allows it but tooling is missing
But I definitely think it's a valid like push use case as well. I think in terms of scalability I think pool will be better but Yeah, I agree Thanks for the talk. I have small question Did you try to implement this
Tooling in the end of the CI CD and co continuous optimization No, like so we're not using it yet for that. I think it's it's definitely a super useful thing because like Yeah, like you you want to see maybe how a pull request behaves like maybe how your application allocates more or less in
Different parts and and if the trade-offs are right there But yeah, I think it definitely can and should be used for that. But no no tooling right now Mm-hmm. Mm-hmm. Yeah. No, I fully agree as well. Yeah
Hello. Thank you. So if I understand correctly profiles such as Traces combined with OS metrics, right? So what is at a concrete? Specific time you can see how much CPU used and so on right?
Yeah, I guess it looks a bit more at the actual line of code rather than I don't know I haven't used like tracing where it automatically finds the function. Maybe that also tells you the line of code, but Yeah, like it it definitely adds some metrics to it like without you doing much
I guess all of them making sure it can read the the symbol tables and the function names Yeah, so so I just had like a dumb question or like dumb idea Couldn't you just combine for example, you already have node exporter which exposes metrics at all times So you have all OS metrics and you have traces for example, so
Couldn't you just have some kind of integration graphana or somewhere else that just combines traces with matter? Yes Like I think that was also the like like people that work longer at continuous profiling software that they try to kind of reuse Kind of Prometheus and I think where you end up kind of in it's just a very high cardinality
It's too many lines of codes and and that's kind of where it stops But like in theory like I guess most prompt UL Constructs and functions are maybe something we need to implement on top of that in a similar way because yet you just get metrics Out of it and so basically the problem was too many lines of code too much changing over time and
Like you just get too much serious churn through that So, thanks a lot