We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Stack walking/unwinding without frame pointers

00:00

Formal Metadata

Title
Stack walking/unwinding without frame pointers
Title of Series
Number of Parts
542
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Sampling CPU profilers periodically fetch the stacks of the profiled processes that are running on the CPU at a given time. Walking the stacks of native processes with a little work is easily possible when frame pointers(FPs) are present. But most binaries in the real world are not compiled with FPs. So it can get quite complicated if profilers have to walk the stacks when frame pointers are omitted. In this talk, we will talk about how we can walk the stacks using the DWARF CFI (mainly .eh_frame). We will also discuss how eBPF is helping us with that and how extending the current stack walking facilities can be useful especially in interpreted languages, such as Ruby, as well as runtimes with JITs, like the JVM.
Physical lawProcess (computing)Right angleProfil (magazine)Software developerPrototypeFacebookPoint (geometry)Pointer (computer programming)Polarization (waves)Theory of relativityStack (abstract data type)Software engineeringMereologyEqualiser (mathematics)Web 2.0Frame problemComputer animation
Chemical polarityPointer (computer programming)Electric currentRun time (program lifecycle phase)Distribution (mathematics)File formatFrame problemBit rateHypercubeInformation securityInformationSystem callSpacetimeTable (information)Codierung <Programmierung>OpcodeFinite-state machineFrame problemTable (information)PrototypeProduct (business)PlanningFile formatLevel (video gaming)Computer programmingCartesian coordinate systemKernel (computing)State of matterCompilerEmailOpcodeAddress spaceStack (abstract data type)InformationDifferent (Kate Ryan album)System callCodeFirst-person shooterPointer (computer programming)Profil (magazine)Run time (program lifecycle phase)BefehlsprozessorSemiconductor memoryMereologySheaf (mathematics)Form (programming)QuicksortSampling (statistics)Process (computing)Mechanism designSoftware developerCache (computing)Latent heatResource allocationSoftwareCompilation albumSinc functionObject (grammar)SpacetimeScaling (geometry)Event horizonOpen setDatabase transactionMultiplication signExtreme programmingPoint (geometry)Group actionGoodness of fitForcing (mathematics)Data storage deviceDressing (medical)WordCASE <Informatik>Game theoryBoundary value problemUniform resource locatorDirection (geometry)Uniformer RaumControl flow40 (number)ArmFerry CorstenPower (physics)Computer animation
Frame problemInformationSystem callSpacetimeOpcodeCodierung <Programmierung>Table (information)Finite-state machineChemical polarityKernel (computing)Data managementReading (process)Pointer (computer programming)Control flowProcess (computing)Read-only memoryComputer programPhysical systemBefehlsprozessorOpcodeFrame problemComputer programmingProcess (computing)Web pageCombinational logicSoftware developerDataflowProfil (magazine)MappingMathematicsCartesian coordinate systemInformationOverhead (computing)Type theoryCASE <Informatik>Semiconductor memoryDifferent (Kate Ryan album)Sheaf (mathematics)AlgorithmLibrary (computing)CountingSpacetimeRippingTable (information)Hash functionMaxima and minimaData structureMereologyVisual systemSet (mathematics)2 (number)Point (geometry)Device driverHeegaard splittingProjective planeView (database)Event horizonVirtual machineCycle (graph theory)Fitness functionKernel (computing)Differential operatorMetadataBitPattern languageLevel (video gaming)Representation (politics)Game controllerLogicPhysical systemPointer (computer programming)File formatExpressionStack (abstract data type)Protein foldingAreaHand fanCoefficient of determinationVideo gameSingle-precision floating-point formatRight angleHypermediaConnectivity (graph theory)Metric systemStaff (military)Direction (geometry)Lattice (order)Data storage deviceElement (mathematics)Cue sportsRankingDirected graphMultilaterationMultiplication signObservational studyComputer animation
InformationProcess (computing)Texture mappingChemical polarityScale (map)Resource allocationBefehlsprozessorElectric currentPointer (computer programming)IterationBinary fileStack (abstract data type)Address spaceHash functionExecution unitRead-only memoryComputer programEmpennageProgrammschleifeKernel (computing)SpacetimeLibrary (computing)Software testingFunction (mathematics)Core dumpTable (information)Semiconductor memoryProcess (computing)BefehlsprozessorInformationFrame problemMappingGroup actionIntegrated development environmentVirtual machineLimit (category theory)Resource allocationUnit testingOperator (mathematics)Bit2 (number)Greatest elementServer (computing)Electric generatorProfil (magazine)QuicksortOrder (biology)Memory managementSpacetimeDifferent (Kate Ryan album)Cycle (graph theory)CuboidConfiguration spacePoint (geometry)Computer programmingSheaf (mathematics)Principal ideal domainMaxima and minimaData structureOcean currentStack (abstract data type)ImplementationPointer (computer programming)Binary codeTable (information)Crash (computing)CASE <Informatik>MathematicsSoftware testingRepository (publishing)Single-precision floating-point formatRepresentation (politics)Kernel (computing)Core dumpType theoryShared memoryFunctional (mathematics)MetadataCodeParsingWritingSensitivity analysisHash functionAreaMultiplication signLinear regressionLibrary (computing)MiniDiscNumberLoop (music)Overhead (computing)Cartesian coordinate systemIterationSystem callTraffic reportingAddress spaceUniverse (mathematics)Computer fileInternet service providerSign (mathematics)Right angleReading (process)Existential quantificationOnline helpUniqueness quantificationComputer animation
Software testingKernel (computing)Computer programMeasurementChemical polarityIntegrated development environmentUser profileComputer hardwareDifferent (Kate Ryan album)Configuration spaceBuildingMiniDiscAddress spaceProcess (computing)File formatCodeStrutOpcodePointer (computer programming)Table (information)Regulärer Ausdruck <Textverarbeitung>Frame problemIntelStack (abstract data type)Revision controlInterpreter (computing)InformationFunction (mathematics)Read-only memoryRun time (program lifecycle phase)Asynchronous Transfer ModeDefault (computer science)VirtualizationSet (mathematics)Level (video gaming)2 (number)Computer programmingProcedural programmingLinker (computing)Goodness of fitLibrary (computing)Subject indexingBuffer solutionFront and back endsFile formatProjective planeSpacetimeInterpreter (computing)Address spaceRow (database)MiniDiscCache (computing)MultiplicationTable (information)Single-precision floating-point formatKernel (computing)ParsingDifferent (Kate Ryan album)ExpressionComputer hardwareInfinityConfiguration spaceMultiplication signComputer fileProduct (business)Profil (magazine)Integrated development environmentStructural loadOpen sourceAlgorithmFormal verificationPoint (geometry)Software testingPhysical systemParsingInteractive televisionCartesian coordinate systemLengthSource codeConfidence intervalSoftware bugInformationFrame problemSemiconductor memoryOpen setJust-in-Time-CompilerLink (knot theory)BefehlsprozessorProcess (computing)CodeSlide ruleVirtual machineMereologyPointer (computer programming)Memory managementDynamical systemDebuggerMixed realitySheaf (mathematics)Functional (mathematics)OpcodeDistribution (mathematics)High-level programming languageChannel capacityFood energySpeciesMultilaterationQuicksortFormal languageGame theoryCodeAvatar (2009 film)Arithmetic progressionPower (physics)Sinc functionBitBEEPCovering spaceAutomationMathematicsRevision controlTraffic reportingState of matterIncidence algebraComputer animation
Chemical polarityWebsiteDew pointCodeComputer animationProgram flowchart
Transcript: English(auto-generated)
Okay. Welcome everyone. Today, we are going to talk about walking native stacks in BPF without frame pointers. So my name is Roshali. I work at Polar Signals as a software engineer. I mostly work on profilers and eBPF related stuff.
Before that, I have worked in different kernel subsystems as part of my job. My name is Javier and I've been working at Polar Signals for a year, mostly working on native stack unwinding, so the work that we're going to introduce today. Before that, I was working on web reliability, tooling for developers, and performance at Facebook.
Yeah. So before we dive into the topic, I wanted to go through the agenda. So first, we actually want to talk about why there is a need for dwarf-based stackwalker in eBPF, because that's like the most asked question. Then we will go into the design of our stackwalker, and then we'll talk about how we went from
the prototype to making it production-ready, and then a bunch of learnings so far on some future plans. So as we said, we work on the production profilers, which means that generally sampling profilers collect the stack addresses at particular intervals and attaches values to it. Note that the profilers generally need the user
and application level stacks as well as kernel stacks, and it involves iterating over all the stack frames and then collecting the written addresses. Historically, there have been a dedicated register for that, called frame pointer in both x86 and 64.
But in recent times, because of some of the compiler optimizations, it hasn't been mostly disabled in most of the runtimes as well as industries. Also, it becomes really hard when you don't have frame pointers and instead of involving
a couple of memory accesses per frame, which is quite fast, we will need to really do more work in the stackwalkers. Note that the stackwalking is also a common practice in the debuggers, as you all know. So what's the current state of the world? Well, it's not a problem for the hyperscalers
because hyperscalers actually have all the applications, which are already enabling frame pointers in the production. This is important because when things break and you want to really go through the inspection, it's really helpful to have all the stack when it's needed.
There's also a recent discussion in the Federer mailing list, so just feel free to go through it. TLDR of that discussion is that it's being, so FPs are going to be unable. Since, I think, Federer 38, so that's about to be released in late April.
Mac OS software always have binaries, which has frame pointers enabled. There's also an amazing work going on by Oracle engineers to have a simple format instead of DWARF, and we hope that that work also goes through and helps the ecosystem.
So that's the current status, but what we want is we want that right now, and we want the support for all the distros as well as all the runtimes, which is scattered over here and there. For example, only Go runtime, and it was FPs since Go 1.7,
and in x86, and since 1.12 in ARM64. So now, some of you might be wondering, if not frame pointers, what do we have? For example, say in Rust, where it has been disabled by its own by default,
but when you have the panic, you still get the all backtracked, so how is it happening? So well, compilers always have this information, and we generally need to know the value of the stack pointer in the previous frame, and it can be from any offset if there is no frame pointer.
So that way, we can always find the value of the return addresses and continue unwinding the stack. This information is generally encoded as part of .eh-frame section or .debug-frame section in the DWARF,
and there's also one other way which is unwind tables can be also synthesized from the object file, which is something being done by ORC format that has been used in kernel for a while now. We will talk in detail about .eh-frame in a minute,
but first of all, let's see if anybody else is using eh-frame already, of course. So the profiler we have developed is not the first thing who is going to use eh-frame. Part of the popular profiler from Linux kernel added
the DWARF support since when the perf event open syscall was added, which was in 3.4, and it can collect the registers for the profile processes, as well as the copy of the stack for every sample. While this approach has been proven to work,
there are a bunch of drawbacks to it. For example, one of the things which perf does is it actually collects all the stacks and copies it into the user space. The second thing is that the implications of one process having another process's data in the user space scale can be quite complicated,
and also it's like a lot of data, especially for the CPU intensive applications. So why BPF? Stack walking in BPF for our profilers actually makes a lot of sense to us, because in that case, we don't really have to copy the whole stack.
The information, a lot of it still stays in the kernel, which provides higher safety guarantees, especially in the case of stack walking mechanism. Once it's been implemented, we can leverage the perf subsystem to get the sample CPU cycles, and then instructions, L3 cache misses, etc.
And it can then also help us to develop other tools, like allocation tracers, runtime specific profilers, for example, for JVM or Ruby, etc. So some of you who are probably also familiar with BPF
may know that there is already BPF GAT stack ID, so why there is a need for implementing something different. Well, as it turns out, the problem with that helper is that it also requires frame pointers. So it also uses frame pointers to walk through the stacks. And for the historical reasons, fully featured dwarf unwinder
is unlikely to be part of the Linux kernel. So before we dive into how we are using EH frame with BPF, let's look at what EH frame actually has to offer.
So the EH frame section contains one or more call frame informations. The main goal of the call frame information is to provide answers on how to restore every register for the previous frame at any part of our code execution.
Directly storing the table that sort of contain each program counter and all the registers, and then locations such as, whether they have been pushed to the stack or not, et cetera, would generate huge unwind tables. And for that reason, dwarf is actually quite compact
and very space efficient in that sense. So the unwind tables encoded in the CFI format are in the form of opcodes, and those opcodes need to be evaluated. So in the case of stack walking, once it has been evaluated, we generate the table that contains for each instruction boundary,
like how to store the value of the register. There are two main layers to it. One is that it helps with repetitive patterns that compress well and allows for more compact representation of some data. In some cases, there is also specialized opcode
that consumes, say, one to four bytes, rather than just four bytes at all time. And the second layer, which we call the second layer, is the spatial opcode that contains another set of opcodes, which is containing the arbitrary expressions. That needs to be evaluated, and that's a lot of work.
The main difference between these two is that in the first one, we just need these two values. But in the second part of it, we will actually need to evaluate the arbitrary during complete expressions. So for that reason, we would need to have the full-blown VM
to be implemented in the BPF itself, which is not quite practical. So those who don't know how, like generally, the BPF applications flow works, this is how it would look like in a very high-level point of view.
So in the user space, you would have the driver program written in Go, like that's our stack, and we are using BPF Go over there. We are creating the maps, attaching the BPF program to a CPU cycle perf event.
It reads, parses, and evaluates the EH frame section of the process and all the dynamic libraries. And in the BPF program, using the PID, we are fetching the table, and then we have an unwind algorithm, which processes the dwarf information.
We will go in-depth for each component, but let's see how the algorithm looks like. So basically, for this one, it's a really simple one, but basically, we just read three important registers. First one is instruction pointer, RIP. Next one is the stack pointer,
and the third one is, of course, frame pointer, RBP. And then when the frame count is less or equal to the maximum stack depth we have defined, we find the unwind table for the program counter. We add the instruction pointer to the stack,
calculate the previous frame stack pointer, then update the registers with the calculated values for the previous frame, and then continue with the next frame. So this is like, just in our shell, that's what the algorithm is in the BPF. But now, the important part is how we store that unwind information
and what we have done to make it scalable. So now Javier will talk about that. Cool. So now we have some unwind information
that we're going to describe the format later, but we need to put it somewhere, right? So one possibility will be to store this unwind info in the memory space of the applications that we are trying to profile. And we could do this, for example, using a combination of ptrace, mmap, and mlock. And we could use ptrace to hijack the process execution
and make it allocate a new memory segment. And then we will have to lock it, because in BPF we need to make sure that the pages that we're accessing are never going to be swapped out. But this has a big problem. That will be altering the execution flow of the program, and this is something that we never want to do.
One of the reasons for this is because, first, this memory will live in the process, which means that accounting for it will be odd, and developers will find a new memory segment there that appear out of the blue. So in their metrics, there will be something that changes behind their backs, which is not great. But also because the lifecycle of managing this memory chunk is very complicated. For example, what do you do if our profiler dies
before the processes that we are introspecting? How do we clean this up? It involves a lot of problems, and adding solutions to these problems will require crazy engineering, which we were not planning to tackle, because it will overcomplicate the project a lot.
The other problem is that sharing memory is way harder, and accounting for our overhead is also very hard. If you think about it, for example, LibC is probably linked in most of the applications in your machine, so why having the same information for every single process, right? So how do we actually store this information?
We use BPF maps, which are data structures that, as Vashali said, they can be written and read both from kernel and user space. We use, in particular, hash maps, which, in the case of BPF, they are basically a mapping of bytes to bytes, where you can put arbitrary information. So this is always locked in memory.
BPF allows you, with this flag, not to lock memory, but in the type of BPF program we use, it is mandatory to lock it. Otherwise, as we said before, these pages could be swapped out, and from the type of BPF programs that we have, page faults are forbidden. In the end, we could reuse these mappings,
because they are in this global BPF map that we have control over, so we can store, for example, libc in one particular area, and then we'll have to add metadata for where it is for every single process that has this mapping. So, yeah, this is a logical representation of unwanted information for different executable segments.
So imagine this is libc, MySQL, zlib, systemd, and then some chunk that is unused. So this assumes that we have, like, this logical view has, like, a chunk of memory that is contiguous, but in reality, we cannot allocate any arbitrary chunk of memory on BPF, we cannot say we want 200 megabytes,
and it needs to be, like, contiguous. We cannot do, like, a malloc, right? So we've been doing some experiments, and in the systems that we have tested, and the kernels that we want to support, we can add up to 250,000 unwind entries to each value of a BPF map.
So because we want to be able to fit larger unwind tables, we basically use the same solution that you would use in any other data intensive application, which is partitioning or sharding. So we're splitting the unwind table into different shards, and depending on the memory that your machine has,
we'll allocate more or less shards ahead of time. And that will result in a CPU to memory tradeoff, because when they get full, we'll regenerate them. But we'll talk about this later. So, for example, let's see system D's unwind table, and let's suppose that it's a bit bigger than 250,000 elements,
so it doesn't fit in a single shard. So because it doesn't, we have to chunk it up. So we store the first chunk in the first shard, and then there's a little bit that is stored in the second shard. Because we have added all these layers of indirection, we need some bookkeeping to do, and this metadata is also stored in BPF maps.
So we have a process that have many mappings. Each mapping can have one or more chunks, and then each chunk maps to a particular shard. Because you might have one unwind entry or up to 250,000 in a shard, we have some other metadata to exactly tell you
where that information lives. So yeah, thanks to this way of organizing data, we're able to, as we said before, share these executable mappings. And thanks to that, we skip table generation for most of the executables in your box. And thanks to this, we only use 0.9% of the CPU cycles
doing the process that Vashali was talking about before, which is not the most complex process in the universe, but it consumes some CPU cycles, because we need to read some information from your L file, find the section, then we need to read the section, and we need to parse it and interpret the DWARF information. So now we need to do it way fewer times.
So in your machine, we're only gonna do it once per unique executable section. So what happens if we run out of space? So basically what we have implemented is basically a bump allocator. We keep on appending information to the shards, and logically you can see it as a big chunk of memory. Once it's full, at some point we'll decide to wipe the whole thing and start again.
But we do it in such a way that we give every single process a fair chance of being profiled. So yeah, let's take a look at how are we doing this. So we start with a PID of any process that you find in your machine that at that point happens to be running on CPU. And the first thing we do is to check if it has unwind information.
If it does, we need to find the mapping with the current instruction pointer that we have for that frame. Then we need to find the chunk. We can find it doing linear search, because we have the information of every single like minimum and maximum program counter that is covered by that chunk. Once we get the chunk, we can have the shard information,
and once we have the shard information, we have the unwind information. But the problem is the unwind information, as we said before, it's basically an array. And this array, we need to find it there. But it can be up to 250,000 items. And if you have done anything on BPF, you know that you don't have like,
you have to basically build whatever you need yourself, and you don't have, for example, some sort of binary search that is executed on kernel space for you, so you need to implement it yourself, which is not a big deal in general. Implementing binary search is not rocket science, but the problem, most of the times, it's difficult to get all the off by ones, right? But the problem here is that in kernel space,
we have a lot of limitations we're gonna talk about later and we're gonna talk about how we overcame them, because this produces quite a bit of code that has to be split logically. So not only the data structures are sharded, but the code is sharded too. So anyways, we binary search this with up to, well, seven iterations, but let's say eight, if you're feeling pessimistic.
And then we're gonna get the unwind action. So what is the operation we need to do to the current registers to recover the previous registers? And add that frame to the stack trace and continue with the next frame, as I actually explained before. If the stack is correct, and we have the luxury to know that
because when we have no unwind information and our BP is zero, that is specified by the x64 API to be the end of the stack, the bottom of the stack. So if it is correct, we hash the addresses, add the hash to a map, and bump a counter. So it is reasonably cheap and I will show you some data later on this.
And then every couple seconds, I think it's every 10 seconds or so, we collect all this information from user space and we generate the actual profiles that we send to some server. As I said before, BPF has some interesting challenges for us. I think it's the closest that I've been to coding in the 90s or 80s, because we have very little stack space.
We have 512 bytes, if I am not mistaken. So in order to overcome that, we use BPF maps as some sort of heap. Then there's a problem that I mentioned before about memory locking. That memory can never be swapped out and it is in kernel space. So we want to make sure that we allocate the minimal space you need
and we need to do it properly because each single environment has a different C group configuration. And as some talks explained yesterday, it's quite tricky to know the actual memory that your machine has available. For the program size, there is two main issues. One of them is that there's a limitation
on the number of instructions that you can store in the kernel, but also the BPF verifier, which is this machinery that makes sure that your program is safe, and for example, your program is gonna finish, you're not referencing any null pointers, sorry, and that in general, you're not gonna crash the kernel,
has a limit on the amount of iterations that it does internally. This is a problem for us because doing a full binary search already fills these limits. So we need to use some techniques like this thing called BPF tail calls that is similar to LISP tail calls.
And if you're lucky, we are not, you can use, well, we use bounded loops, but we're gonna use this new helper called BPF loop that basically it's a function that you can call multiple times creating some sort of loop in BPF, but we cannot use that because we wanna support all their kernels. That's a pretty new feature.
So let's switch to something else. We have written our application in user space in Go, and we are a profiler, so we wanna make sure that the overhead we have on your machine is as little as possible. But unfortunately, many of the Go APIs aren't designed with performance in mind. I am new to Go, I didn't know this was like this, and every single time I profile our profiler
and I found these things, I was like, how? How is this possible? But it has a dwarf and elf parsing library in the standard library, which is great, but they are not designed for performance sensitive environments, let's say. And also, there's two functions that are binary read and binary write
that we use all the time because we need to deal with bytes back and forth that allocates in the fast path. But anyways, we profile our profiler all the time. We have found lots of opportunities that we keep on fixing, but of course, there's more work to do. And one of the areas where we try to be
pretty comprehensive, it's with testing. So we have thorough testing, well, unit testing for most of the core functions to ensure that we don't regress. But I think that, in my opinion, has helped us the most is snapshot testing. If you're not familiar with this technique, it's very simple. You basically generate some textual representation of your data structures, write them to disk
or somewhere in memory, doesn't matter, and then you generate them again after you make some changes to your code and then you compare them. So this is how it looks in our case. We have some Git sub repository called test data, and then we have this textual representation of the unwind tables. You don't have to understand it all, but the idea here is that this covers a full function, which program counter starts in the one over there
and ends in the one over there. And then we have the information for every single program counter, and then it tells you, for example, what to do here. The first one says CFA type two that I know is for RBP. So you need to get the current RBP at eight, and that will give you the previous frame stack pointer. But anyways, the interesting thing here
is that this is very easy to implement, as you can see by our very advanced setup in our make file. We just build our binary. We dump these tables to disk, and then we ask it to give us the changes. And if there's anything that has changed, we fail. So thanks to this, we have found a lot of bugs
and it has allowed us to iterate with confidence. One of the important things in this project has been de-risking it. It's been quite complex. When I started working on this, I had no idea about dwarf unwinding. I had no idea about unwinding without frame pointers at all. So we had to make sure that all these avenues were properly covered.
We had, for example, the dwarf parser properly implemented, that we had all the interactions with BPF cover, and that the BPF unwinder worked well as well. So for this, we always try to have a plan B at every stage of the project, and we try to go in-depth as well as in breadth. But anyways, I have five minutes left apparently. So we had a lot of automated testing,
and one of the things that we did was adding kernel tests, which is super important, especially for BPF programs, because the BPF subsystem changes a lot over time, and there's a lot of features that we want to make sure we don't use because otherwise it wouldn't work in other kernels. So we have a kernel testing system where basically it runs our application
in multiple kernels and reports the state. And one of the things that I want to talk about is that production, as usual, brings a lot of interesting challenges. So by deploying our profiler to production, we found a lot of things that we didn't know about, and we were able to find some of these things thanks to using continuous profiling, our own profiler on our profiler.
As you know, different hardware and different configuration are the biggest sources of performance differences as well as incidents in production. So I want to show you two things that we have found recently. One of them is basically we're using almost 30% CPU time opening files in our production environment that never showed up on my NVMe.
And the reason is because it turns out cloud disks are very slow. So we are, now we have fixed this. Another very interesting thing that we fixed the other day, it's something that happened when we rolled out our profiler to production and then it started crashing. If you are interested, we will upload the slides,
so feel free to check the pull request because everything is open source. But basically what happened here was that, for reasons, Go has a signal-based profiler, and we have it enabled for even more reasons. And this only was enabled in production. So SIGPROF was interrupting our program execution while we were trying to load the BPF program. The BPF program takes a little while to load
because the verifier has to run a bunch of algorithms to statically ensure that everything's safe, and it was getting interrupted all the time. libBPF, that is the BPF library we used to load the BPF program, was retrying this up to five times until it basically said, I tried, this didn't work, sorry, and obviously we need the BPF program to be loaded to work.
So there's many other considerations in this project, like short-lived processes, which we haven't optimized for, but we are still pretty decent at. If your program runs for one second, we're probably gonna catch it. But if this is something that you care about, feel free to message us. It will be something that we optimize. And then, yeah, this is our current format.
I probably have one minute left or something like that. So you don't have to understand it all, but the point is we represent every single row with two 64-bit words, but since we are making it a bit smaller, and this is basically how our size compares to DWARF. We are bigger because DWARF is optimized for disk while we are optimized for disk space
while we are optimized for just raw speed. So for example, our whole table for one shard pretty much fits in L2 cache. I guess, do I have any more time? Probably not, right? Two minutes, oh, okay, sorry. Maybe I sped up too much. So we need to support parsing
every single DWARF CFI opcodes, and the reason for this is because otherwise we won't be able to progress. But we cannot unwind from every single program counter, which sucks, but this is not a problem in practice. The reason for this is because the most typical way to recover the previous frame stack pointer is,
which is called CFA in DWARF, but doesn't matter, is that you will get given which register you need to apply some offset to, and that will give you the previous frame stack pointer. We support that, but the problem is that it could be any arbitrary register. And right now we only support either RBP or RSP offsets, which happen 99% of the time. So this is something that we're gonna work on soon.
The other problem, as Vashali said before, is that DWARF has a VM that you need to implement, which has to be Turing-complete, and can implement any expression. It's not Turing-complete. The second level, yeah. The DWARF, no. That's why in the infinity project they have to add this new opcode. No? The DWARF had.
Okay. It's not exactly Turing-complete, it's almost there, yeah. Okay, well. But you need to implement a VM that basically has a set of virtual registers. You need to do the stack machine to, yeah, yeah, yeah. But the second, well, we can talk about those later, because the first level, yeah, it's the stack machine, 100%, but the second level is, I can show you our code, it's messed up. Like, it's messed up.
But anyways, but the thing is that we are very lucky here, and you can check more about this in this PR. So there's two DWARF expressions that account for 50% of all the expressions that we have seen in most distributions. They are the expressions used by, well, the dynamic linker needs them, basically, and they are the expressions for procedure linkage tables, or PLTs.
The other good news, as I said before, is that RBP and RSP offsets rarely occur, and all the other possibilities that I haven't talked about, they almost never occur. Like, we've seen them very, very, very few times. What indexes are you supporting? Oh, good question. You're playing with AR64, because the GCC AR64 backend generates CFA expressions.
That's what I was talking about. Yeah, yeah, yeah. So, but right now we only support x64, but I'm also gonna talk about this later. Sorry, sorry. But anyways. Either, no. Okay, done? Okay, well. Wait, but we have the minutes buffer for the next one, right? Five minutes. Five minutes. Okay. I have two more slides.
Well, anyways, our BPF program, we tried to make it as fast as possible. So this was running on my machine with a bunch of applications that have 90 frames or more. So even the maximum time that it takes is 0.5 milliseconds, which is not terrible on my CPU, which is from late 2017. And this is in a big part
because we have optimized everything for memory. So everything's aligned properly, and we try to fit as many things as possible in the CPU cache. What about high-level languages? So there is a project that I happen to work on, which is called rbperf. So this is something that we're gonna be adding
in the future. Basically, for dynamic languages, you need to have knowledge of the ABI of every interpreter version, and then the stackwalkers are also implemented in BPF. But instead of getting the return addresses, because you have no return addresses there that are meaningful to you, you have to directly extract the function names and other information of the process heap. Our project is called parka.
So there's a couple things that we're gonna be doing, like mix and y mode, that as far as we know, no one else does this in profiling, in debuggers for sure, which is the idea of that different sections will be unwound using different techniques. So for example, if you have a JIT that will be used, like Node.js, that has frame pointers, so you will unwind it with frame pointers, but once you reach the actual code
from your interpreter, which is compiled, and has unwound information, we will use dwarf unwound information. ARM64 support is coming late this year, and this feature is now detailed by default, but it is stable enough that we're gonna be enabling it in a month. And then we're gonna add other runtimes, such as the JVM or Ruby. And then just to say that we are open source, userspace, Apache 2, BPF, and the GPL.
And yeah, all the links are here, and yeah, thank you so much.