Configuration Driven Event Tracing with Traceleft and eBPF
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 50 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/43090 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
All Systems Go! 201848 / 50
1
4
11
12
13
14
16
17
19
23
24
25
29
30
32
34
35
39
40
41
43
44
50
00:00
SpacetimeSystem programmingShift operatorShift operatorInformationMathematical analysisWordSoftware developerStaff (military)Software frameworkPresentation of a groupMeeting/Interview
01:15
BlogSystem programmingInformation securityPoint cloudShift operatorPersonal digital assistantTotal S.A.Computer programEvent horizonSharewareArchitectureStochastic processSystem callOperations researchComputer networkKernel (computing)Revision controlFrequencyPerformance appraisalInterrupt <Informatik>Function (mathematics)Latent heatTracing (software)Information securityCartesian coordinate systemPoint cloudAnalytic continuationOpen setDifferent (Kate Ryan album)Mathematical analysisBitComputer architectureReal numberMachine codeEvent horizonInformationCASE <Informatik>Software frameworkDiagramPresentation of a groupShift operatorVideo gameRun time (program lifecycle phase)Computer animation
03:02
System programmingDifferent (Kate Ryan album)Moment (mathematics)Functional programmingCartesian coordinate systemInformationCategory of beingService (economics)BitTracing (software)GradientMorley's categoricity theoremMeeting/Interview
03:36
FrequencyEvent horizonPerformance appraisalInformation securityInterrupt <Informatik>Function (mathematics)Latent heatStochastic processTracing (software)System programmingPoint (geometry)Interrupt <Informatik>Operating systemCartesian coordinate systemDataflowInformation securitySpacetimeFunctional programmingMultilaterationEvent horizonFrequencyMathematical analysisEnergy levelBitComputer programmingTracing (software)Basis <Mathematik>Scheduling (computing)Row (database)Computer animation
05:07
AerodynamicsSystem programmingTracing (software)CodeCartesian coordinate systemFunctional programmingPoint (geometry)System callCodeSpacetimeTracing (software)Online helpMultiplication signDynamical systemInterpreter (computing)Kernel (computing)Embedded systemEvent horizonPersonal identification numberFlagScalar fieldProfil (magazine)Form (programming)Default (computer science)Chemical equationPower (physics)Meeting/Interview
07:03
System programmingCodeLink (knot theory)Tracing (software)AerodynamicsStochastic kernel estimationLTI system theoryDecision theoryComputer programInformation securityDigital filterCompilation albumComputer networkKernel (computing)Control flowSystem callBytecodeInterpreter (computing)Event horizonData bufferTimestampKey (cryptography)StrutNamespaceSoftware frameworkBinary fileImplementationInformationComputing platformBuildingSoftwareTracing (software)Kernel (computing)CASE <Informatik>Interface (computing)MappingSystem callPoint (geometry)Event horizonInformation securitySpacetimeComputer programmingCodeBuffer solutionPerspective (visual)AreaPublic domainFunctional programmingMultiplication signBinary codeContext awarenessBefehlsprozessorStructured programmingFunction (mathematics)Library (computing)Computer fileVirtual machineDefault (computer science)ImplementationHookingComputer architectureSystem programmingSoftware frameworkPatch (Unix)Single-precision floating-point formatStochastic processPrincipal ideal domainDecision theoryRing (mathematics)BytecodeTimestampMachine codeOpen sourceOpen setFirewall (computing)Surjective functionKey (cryptography)Computer animation
13:50
System programmingArchitectureRepeating decimalEvent horizonBDF-VerfahrenComputer programComponent-based software engineeringMeta elementData structureOpen sourceKernel (computing)MaizeMetric systemStructural loadSystem callStochastic processCodeEvent horizonComputer architectureView (database)SoftwareComputer programmingBitGame controllerSystem callLevel (video gaming)Latent heatComponent-based software engineeringCartesian coordinate systemMereologyFunctional programmingTracing (software)Structured programmingSheaf (mathematics)Meta elementSource codeRevision controlDataflowMappingStochastic kernel estimationRight angleMeeting/InterviewProgram flowchart
16:15
Metric systemComponent-based software engineeringEvent horizonAsynchronous Transfer ModeInformationDigital filterRule of inferenceFunction (mathematics)Stochastic processFile formatStreaming mediaoutputSystem callStructural loadCodeMeta elementData structureKernel (computing)Open sourceComputer programPersonal digital assistantCommon Language InfrastructureSystem programmingHash functionFlagImplementationMappingEvent horizonOpen setState of matterPosition operatorShareware2 (number)Rule of inferenceMeta elementStochastic processStructured programmingVariable (mathematics)Parameter (computer programming)BitComputer programmingFunctional programmingKernel (computing)Open sourceSheaf (mathematics)Common Language InfrastructureMeeting/InterviewComputer animation
18:24
Rule of inferenceView (database)CountingComputer fileSoftware testingPrincipal ideal domainBinary codeScripting languagePrincipal ideal domainStochastic processSharewareRight angleLatent heatLine (geometry)Event horizonSystem programmingConnected spaceReading (process)Online helpServer (computing)Client (computing)System callInformationRadical (chemistry)Gastropod shellType theorySubset2 (number)Computer animation
20:44
FingerprintView (database)RootkitHash functionRule of inferenceComputer networkServer (computing)Web 2.0SoftwareComputer programmingWebsiteComputer animation
21:27
Personal digital assistantSystem callImplementationSample (statistics)System programmingLine (geometry)SharewareSystem callRight angleComplex (psychology)Kernel (computing)Computer animation
22:24
System programmingExecution unitGastropod shellComputer networkFunction (mathematics)Error messageHash functionString (computer science)Principal ideal domainLink (knot theory)Event horizonTrailOvalTraffic reportingOpen setSystem callFlagData bufferWeb pageRead-only memoryVirtual realityStochastic processTracing (software)Cartesian coordinate systemBitMultilaterationProfil (magazine)Principal ideal domainKernel (computing)Context awarenessComputer programmingGroup actionFunctional programmingSpacetimeSystem programmingCodeInformationSoftware frameworkComputer fileLimit (category theory)Stochastic processCASE <Informatik>Level (video gaming)String (computer science)Subject indexingPointer (computer programming)Multiplication signMathematicsDifferent (Kate Ryan album)NamespacePower (physics)Object-oriented programmingElectronic mailing listFitness functionBuffer solutionFerry CorstenSinc functionTask (computing)Link (knot theory)Social classSoftwareOpen source1 (number)Equivalence relationFamilyNetwork socketWeightAdditionScripting languageMoment (mathematics)Gastropod shellMereologyOrder (biology)Offenes KommunikationssystemWeb pageDeterminismVirtual memoryCondition numberPoint (geometry)Parameter (computer programming)Address spaceSemiconductor memorySystem callMechanism designReading (process)Process (computing)Ocean currentExecution unitComputer animation
29:20
String (computer science)Open setWeb pageStochastic processVirtual realityRead-only memorySystem programmingTrailWritingProcess (computing)Computer programLevel set methodHookingJust-in-Time-CompilerKernel (computing)Group actionService (economics)System callTupleData bufferEvent horizonRing (mathematics)Address spaceComputer configurationParallel portStochastic kernel estimationBefehlsprozessorFunction (mathematics)Open sourceRAIDReading (process)Buffer solutionBranch (computer science)Computer fileSystem callFunctional programmingEnergy levelDefault (computer science)SpacetimePoint (geometry)Computer programmingConnected spaceOpen sourceMultiplication signConcurrency (computer science)ImplementationGastropod shellFile systemOffenes KommunikationssystemString (computer science)Open setPosition operatorKernel (computing)Process (computing)Digital electronicsBitRing (mathematics)Scripting languageDrop (liquid)MereologyInformationProjective planeEvent horizonSelf-organizationDirectory serviceCASE <Informatik>Moment (mathematics)Information securitySystem programmingSimilarity (geometry)SoftwareTrailNumberOcean currentObject-oriented programmingSemiconductor memoryStochastic kernel estimationWeb 2.0Stochastic processCodeBefehlsprozessorRevision controlParallel portModulare ProgrammierungDifferent (Kate Ryan album)Proper mapReal numberFactory (trading post)Uniform boundedness principleCoroutineRight angleRootkitWritingLetterpress printingMatching (graph theory)Flow separationSubsetSocket-SchnittstelleUniqueness quantificationComputer configurationComputer animation
35:58
System programmingLevel set methodArchitectureLink (knot theory)Projective planeCodeSystem programmingTracing (software)DemonGroup actionSystem callKey (cryptography)Kernel (computing)Formal languageModulare ProgrammierungLandau theoryMultilaterationComputer animation
37:16
System programmingOverhead (computing)Directory serviceProjective planeNumberBinary codeSheaf (mathematics)Open sourceProfil (magazine)SpacetimeTracing (software)Software repository12 (number)Computer programmingComputer animationLecture/Conference
39:06
System programming
Transcript: English(auto-generated)
00:06
Today it's me, Suchakra, and this is Albin, and we're gonna present TraceLeft, it's a configuration-driven eBPF tracing framework that was done with ShiftLeft as well as Kinvulk,
00:22
so I'm from ShiftLeft and this is Albin from Kinvulk and we did this together in conjunction and we'll see what all this is about. So, I'm Suchakra, I'm a staff scientist at ShiftLeft, this is some information about me,
00:41
you can follow me at Textology if you want. I did my PhD from Dorsal Polytechnic, I love tracing and I love performance analysis, and... Hello, I'm Albin, I'll just say I love Kubernetes, I love Linux development, I'm a CTO at Kinvulk. And I will say just a couple of words about Kinvulk.
01:07
So I guess since you are at all systems, maybe almost all of you know Kinvulk, but we are just software development team on Linux, on Kubernetes, and we love this kind of thing.
01:24
And something about ShiftLeft, we are a continuous security for cloud native application company, we try to provide static analysis and carry it forward all towards runtime, so we basically kind of predict your applications. So what's the agenda for today?
01:42
So we are gonna talk about TraceLeft, some background about it, some background about tracing actually, just to give you information of what we are dealing with here. Architecture of TraceLeft, there is trace configuration because this is configuration driven, so you can actually do configuration. How the configurations are presented,
02:01
how the events are taken out from the configuration, and because it's based on BPF, some background about what eBPF is, we already have Alexy somewhere here, so I'm already scared about this. And then use cases where we are using it, and then most importantly about some challenges that we faced, Albin is gonna discuss that,
02:22
and the future work that could be done in this regard. So I'll start off the first half of my presentation, and Albin takes it eventually. So to give you a little bit background about tracing, how many of you have used tracing or any kind of performance analysis frameworks in real life?
02:41
Okay, a lot of people, that's super awesome. So you may be using it in different kind of ways, one of the most common ways to use tracing is throughout your stack, actually. I've tried to differentiate this using this small diagram, where you must have been hearing about open tracing,
03:01
Jaeger, all these new frameworks, these all fall under the distributed tracing category. This is all actually a gradient, I would say, it's not like distinct categorization, it's mostly a gradient, but you can see some differences there. For example, like in distributed tracing, you would get information about what's flowing from one service to another microservice
03:21
to another microservice. You may get information about individual functions that are inside each of the microservice and how they're communicating, which falls into the category of application tracing. The moment you go a little bit more down, you can even know about what was happening inside the application from the infrastructure level. So this is where you can see
03:41
what was happening inside the OS when a given function was being executed in user space. So this is kind of what tracing is. It's very different from other ways of performance analysis in the sense it gives you exactly true flow of an application, and that's why we actually call it tracing.
04:00
Because it's running on very high frequency events that are generated, such as syscalls, interrupts, and scheduling events, which are there and from the operating systems. So we need it to be super high-performance as well. The basis of tracing is instrumentation. I'll explain it a little bit later,
04:22
and it's used for performance analysis as well as security. So based on the same base of instrumenting a specific function, you can either use it for performance analysis or you can use it for security. So a very fine example which I keep on giving to people what tracing is. Think of your program as this bike which is running,
04:42
and you spray some paint on the tires of the bike, which is what you're instrumenting your application. And these individual points where you had just sprayed, it's basically a trace point. So as you start running your bike, you are generating events, and you get an actual trace on the road.
05:03
So based on that, you can find out where it was recorded. For example, these traces will give you exact time at what event happened. So you can think of it visually like that. So tracing can be static or dynamic. Static tracing is, a lot of static tracing infrastructure
05:21
is already there in the kernel, or if you're writing your own user space applications, you can instrument them yourself. For example, kernel trace points with perf, ftrace, eBPF, they all support static tracing now. If you're writing your own applications, you can have compile-time instrumentation embedded. I don't know how many of you have used some certain flags in GCC, like SIG profile,
05:42
or PG, profile-gathering instrumentation. You may be able to use that. There are other applications like LTTNG, which provide this, and USDT. So by default, you have a lot of trace points already there in JVM, in the Python's interpreter, as well as the Ruby's interpreter. Dynamic tracing is more awesome than static tracing,
06:02
I would say, because your application keeps on running. You can just dynamically insert any point there and start probing what's coming out from your application. By application, I also mean the kernel. So kernel also provides dynamic tracing infrastructure in the form of kprobes, kretprobes. If you're in user space, you can write your own infrastructure
06:21
by using dynamic instrumentation tools, such as pin tools and BINES. Uprobes are also there. You can dynamically instrument with the help of the kernel and user space application, and then get information out of each function's execution. There used to be dtrace. I think there is Stellis dtrace on BSD and Mac,
06:42
but I have not used it very thoroughly. This closely resembles what eBPF provides in tracing these days. So to move very quickly, so code instrumentation, you want to know about this function. You insert some call, which is call me maybe. When the function gets executed, the call me maybe gets executed.
07:00
You collect some data. You fill your data with whatever you want. It can be timestamps if you are looking for performance, and if you don't take timestamps, you can only look for individual events. It comes into the domain of auditing and security. So in kernel, as an example of this, in kernel, the kprobes-based instrumentation
07:22
is provided using this, where you actually have a kernel function, which is patched. The first instruction gets patched, and it goes onto another handler, which it goes onto another area, which is actually called as a trampoline, and where you can save your registers, you call the prehandler,
07:40
which basically collects whatever can run at the prehandler. There are multiple collectors that can run there. One of them is eBPF. That's what we are using in our whole infrastructure, and then you can restore the registers, the original instruction which got displaced, gets executed, and you jump back for your normal execution. The actual thing is more complex than that, but to simplify, I have explained it like this.
08:03
So what is eBPF? Yesterday, we got a very good definition. I give you one more definition from my perspective. It's a stateful programmable in kernel decisions for networking, tracing, and security. That's how the user space folks understand BPF. Maybe the kernel has different opinions about this,
08:22
but I want this interface from the kernel to the user space to be as seamless, that it becomes the one ring to rule them all for networking, tracing, as well as security. So just a small intro. I'll go through it very quickly, because we have seen it in previous talks about this.
08:41
The classical BPF used to be there from 1993 onwards. It was used for network packet filtering. Sometime later, seccomp-based BPF programs were also added so that you could actually do syscall filtering there. It was a small in kernel VM, very, very small and very easy to use bytecode.
09:02
And then it was extended eventually as eBPF with more registers and more complex and a better verifier. You could attach on trace points, key probes, U probes, USDT, like whatnot. I'm more interested in tracing, so I'm just focusing again and again on that. You can also use it for network packet filtering
09:21
or much more other use cases that have already been discussed. There's a new syscall, so you can control it via BPF syscall. There's trace collections with BPF maps or directly taking the data to the trace pipe, which exists in the kernel already.
09:41
I mean, this facility is already provided. It's been upstreamed in 3.18. There is bytecode compilation, which is also upstream in LLVM. So if you're using Clang LLVM, you by default have a machine, a target where the BPF bytecode can be generated. So a program looks something like this. There's a BPF program. You compile it with LLVM Clang,
10:02
and then you can directly insert it using the BPF syscall inside the kernel. It gets verified, and then native code gets generated for the architecture on which you're running it. You can design the program in such a manner that it can hook onto kernel functions. The data can be shared between user space and kernel using BPF maps.
10:21
With kprobes, it's exactly the same thing, but it's attached to a kprobe, and you can use BPF maps to read and update and share the data between what you collect and how you collect it, and the events that come out from each of the program's execution, they can be either given to the trace pipe
10:42
or a perf buffer, and then you can build your infrastructure over this. So we use this as the base for tracelift. More easy example of how a BPF program looks is it's in restricted C syntax. This is actually how it looks in the back.
11:01
So every time you will have your kernel function being a hit, this program is gonna get executed. So some of the things it has is that you can see some helper functions here, like BPF get SMP processor ID gives you on which CPU it's running right now, the current PID of the current process, you can get these things,
11:21
and then from here, because you have the context, which is actually all the registers, all the register values when the kernel program was hit, so you have all those registers ready, so these are simple helpers to get the arguments from that if you are following the calling conventions
11:41
on whatever architecture it is. So you get these values, you can then build your own event, an event is stored like this, it's an event structure, and then you can output it to a perf buffer. So the definition of events look like this, these are maps, this is the way you can share data between the user space and the kernel,
12:02
and for example, this is the structure for this specific event, which is also stored in maps, and then you can output them. So which brings us to tracelift. It's open source, you can go here, and you can look at what tracelift is, how it's designed. It's a framework to build syscall network
12:21
and file auditing monitoring tools, it's a work in progress, I should tell this to you beforehand, there's a lot of stuff that can be done here, and you're welcome to contribute. It's CBPF and kpro base, and it has been tested to work on kernel 4.4 plus, still 4.16, 4.18 is not working right now, I tried it, but we have some patches we are working on,
12:44
probably tomorrow they'll be fixed. So also, it has a binary called tracelift, which is a reference implementation of the framework itself. The main goal here is that there is a single binary, which is there, and a battery of what you want to trace. For example, what syscalls you want to trace,
13:02
what events you want to trace. You just have a single binary and this whole battery, and you put it onto any system for which that single binary and that battery has been provided, and it's going to generate events. There is no need for BCC to be there, there is no need for any library to be there on the system on which you are running.
13:21
So it's like a pre-generated thing, it's a very targeted tracing thing, what I tried to call it, then this was our use case, which we used internally at shift-live, where we just built, for one specific machine, we just built this battery, what we want to trace, we built this binary and we just put it,
13:41
and it starts generating events and we just save it. So everything is compiled based on a configuration you give for compiling the whole small binary that you have, as well as for the battery of events that you want to save. So why? Because tracing that just works, I mean, obviously.
14:02
This is a high-level view of the architecture. You want to trace an application's flow of certain calls. There is a main BPF program that is there, which actually puts kprobes on all the functions that you want to monitor. For example, we have just syscalls here, and then the data is sent to the program maps.
14:23
I'll go into details of each of these sections later. TraceLeft controls all of them, and then there are specific event handlers for each of them so there's just one program that is already there which actually calls individual eBPF programs for each of the event that we want.
14:42
This is what it looks like a little bit more deep. So there's a specific map, there are kprobes and kredprobes and based on each individual event that is there, the specific event handler eBPF program is called via tail calls. So there's one base main eBPF program and then it makes tail calls to individual
15:01
small, small eBPF programs which each generate a single event for a single probe that you have put either on a syscall or on any other network event. And it puts all of them in a perf map and TraceLeft keeps on reading from that perf map. There are multiple components of this.
15:22
It's not an ideal scenario because you have a meta generator which generates Go structures, which generates C structures because you're generating these events and they have to be stored somewhere. So it looks at the kernel's syscall debug tracing
15:40
event syscall for the syscall's battery of the probes and then it generates those structures directly. And then we have a generator which generates the individual handler programs which are on the right hand side there. And then a battery which is actually compiled version of all of that. And then there is an actual part.
16:01
You can go on the source code on GitHub and you can see what each of these components are doing there as well. So then you have a probe which is responsible for registering and then there is a tracer which actually starts polling individual perf maps and giving you the data. And then we have a reference implementation
16:22
of an aggregator also which exposes an aggregator API so you can get all the events and aggregate them. A configuration looks something like this. So you want to trace an open event. It has all these arguments. For example, first position, second position, third position which obviously looks like open.
16:42
It's a per event configuration. What do you want to collect? What are the variables you want to collect at each of the sections? You can do it. It's just done once. Once you have to just do it and update it very rarely for each of the event inside the kernel. You don't have to keep on updating it. And then there is this aggregator.
17:01
So you have channels. You can save the data to a log. You can send it to gRPC. You have events like open. For example, we had open. And then you can set rules. This is still not completely okay. And then you have how you want to aggregate it. So what functions you want to apply when you're aggregating it.
17:20
So these are two configurations that you provide. And based on that, the events are generated. There's the whole build process which goes on. So as I told you that this is the first meta generator state which generates each individual structures. Then source for each individual handlers.
17:41
And then BPF programs after compiling using Clang. Maybe we can update it later by not using Clang and just directly using LLVM API. And then it generates your own binary which is your own implementation. A CLI, for example, looks something like this. This is the reference implementation
18:01
which actually just tries to trace everything based on the battery. I think Albin can explain you a little bit more about this.
18:21
Hello, so I will do a demo. Let's see if it works. I prefer two very short demo. And since I don't remember what everything I'm doing, I took some notes. I have two shells. One on the right with this PID. And I will start the trace left binary.
18:43
And I will tell it to trace. And I will load some specific BPF handler. And to apply this only on this specific PID. So the handler I call for the read on the right system call. So let's see what happens when I run that.
19:01
And that should trace the terminal on the right. So if I try to type something, I can see all the read on the right system calls. That's the first demo. And while doing that, I only trace this specific PID. So I can specify on the common line here
19:21
not to trace all the system calls from the whole system, but only for this one. Let me start the second demo. So for this demo, I prepared a script. It's a very simple script that starts TCP connection.
19:42
So after a couple of seconds, you start a TCP server on a TCP client and connect to each other. And I will ask a trust left to trace the BusyBox script shell. So let me start the script.
20:04
And then I start the tracer. After a few seconds, I should see the TCP connect event with specific information attached to it, like the connection tuple, TCP source, destination, et cetera. And if I stop this, I should see the TCP close event.
20:22
So as you can see here, I see connect on close, but I don't see the TCP accept event. That's because I only traced one specific process. That's the process of the shell script here. On that, it didn't trace the other one.
20:41
I have another demo with the help of Sushakar. Here, I'm logged on a web server. I just started trust left. Instead of specifying read or write, just pass all the networking battery, BPF program that we have, so that does the same thing.
21:04
Yeah, and I'm just opening my own website. And I just opened my own website. And we can see from where it connects and basically trace the network calls which are going on this server. So this is just a simple server which is running nginx and my own website.
21:21
That's it. So there was one more like elaborative demo that we did internally based on the trace left where we had like our own monitoring agent for syscalls.
21:41
And this was using the aggregator API that is provided by the trace left itself. And it looks something like this. I don't have this demo right now, but you can at least appreciate that. You can make something as complex as a nice end curses UI for like syscall monitoring based on trace left.
22:03
So I think Albin continues from here and he discusses something very important that we learned. So this is more important than trace left because this shows the challenges we faced and how we overcame some of them and what else can be done later on. Yes, so a lot of the challenge we faced
22:22
is because we wanted to support kernel 4.4. And some of the issues we faced have been fixed in later kernel versions, but I will explain a bit the context of that. So the first challenge I will explain is matching PIDs on applications. So the goal of trace left is to have
22:41
some kind of tracing profile for a specific application. And one application can be one or several process. Sometimes it can be very short-lived process. If it is shell script, it start and stop a lot of process. And an application can be maybe a systemd unit running inside a cgroup starting by systemd.
23:02
Or it could be a container. In that case, it might live in different Linux namespaces and different cgroups. So when we wanted to implement that, we looked at the different BPF helper function to see what exists there. I just mentioned a few of them. I just looked at the ones which mentioned PID
23:23
or cgroup or namespace. There is the first one, which get the current PID and process ID, task ID, which exist in kernel, since kernel 4.2. So that's perfect because our restriction
23:40
was it has to work on the kernel 4.4. There are some others get cgroup class ID which are not related to tracing. So on some other, which I put in red because that didn't fit our criteria that it has to work on kernel 4.4. This list comes from this GitHub webpage
24:01
that's very useful as a documentation to list all the BPF helper function and see what things you can do there. So since here, basically on kernel 4.4, the only thing I could use was to check from the BPF program what is the PID
24:22
of the process being traced at the moment. So as a consequence, the API or trace left is based on that. It looks, there is actually a function as part of the API where you have to give which PID you want to trace and then the BPF program is going to look at that.
24:41
Of course, we want, when we build the whole framework using a trace left, we want to match the application and not only one single PID. So we need to use something else in addition to that. At the time, we implemented that using a Linux facility called the POC connector.
25:02
So the POC connector is as a family on the Netflix socket family. How many of you know what is Netflix? Sorry, Netlink. About everybody, cool. So using Netlink on the connector,
25:21
you can receive, it's a publish describe mechanism when you can receive some events and in this case, we can get information, a notification whenever there is a fork exec or exec on the new process. So we can know whenever there is a new process and maybe that's one process that belong to the application we want to trace.
25:42
That's something that is quite old so it works fine on Linux 4.4. This POC connector has quite some strong limitations so it's not really perfect but it works okay. It only works on, it doesn't really work in a container.
26:01
It has to work in an initial user namespace, in an initial PID namespace, et cetera and require network privilege, which is a bit weird when you are pressing processes. Although it doesn't give all the information it doesn't give a cgroups information on a space so it means whenever we use this,
26:22
we have to, in addition to that, read in slash proc to get the additional information we wanted. But then, reading from two different source information, it brings some waste condition which are a bit difficult to solve and sometimes not directly solvable.
26:40
Some, for example, short-lived process which are quite happen often on shared scripts. The trace process might not have the time to read in slash proc the information we need. So I would recommend not to use the proc connector for that but that's what
27:01
that come from the limitation we had that it has to work on kernel 4.4. Now there is new BPF helper function that are more suitable. For example, we can get the cgroup where we are running which this one exists since kernel 4.18.
27:25
And in general, I would recommend to use a new facility or if they don't exist, try to improve the kernel. Another difficulty we had was related to strings in eBPF. So I take this example. So let's say in user space application in your program,
27:44
there is this open system call where you pass as a parameter the file name. And then since we added kprobe with the BPF on the open system call, we will at some point have this BPF helper function that will be executed
28:01
that will read the file name. So it will copy the buffer, the string buffer. And then when the syscall is actually executed in order to implement the open system call, the kernel will copy again the same buffer for the file name. So there are two copy here.
28:20
This brings some things which are not perfectly fine. That's a time of use issue. So if the program is multithreaded and change the value of the buffer in user space, we might not see what's really going on there. There are other issues with strings.
28:41
So before, since we are running on kernel 4.4, we didn't have this nice helper function which can copy strings. So what we did instead was arbitrarily copy 256 bytes, which quite often is enough for a file name, but sometimes it's not. And we had other issues like let's say this is here
29:02
your virtual memory of one process. You have some region which are mapped in memory and some address which doesn't map to any physical memory. And if you give a pointer to quite close to the border of the mapped region, maybe you will not be able to read 256 byte
29:22
because we will go outside of that region. That could cause a fault, which is fine in BPF because all the BPF helper function that you use are correctly designed not to crush your kernel, but still that's caused some surprise when developing this.
29:43
Another challenge which is not really solved properly here, it's about identifying files. So since we did, when we read or write on the file system, we use the read and write system calls and we pass the file descriptors.
30:00
But it's not so easy to track what file descriptor match which file name. So to do that properly, we will need to track this file descriptor belong to this file and so on. But here in this example, if you use the system call,
30:21
then that make it a bit more complicated. And actually processes can get file descriptors from different sources. There is of course open or open at system calls on this system calls. But you can have a lot more place where you can get the new file descriptors. For example, you can receive one from unique sockets
30:42
and that's not easily traceable. Another way is when, even for the open system call, where we are given a string for the file name, it's not so easy to map that string to the actual file with the mount number
31:01
and the iron-on number and so on. That's because this path is going to be looked up, taking into account the mount space you are in, the ch root, the root directory you use, or if it is a relative path, the current working directory and so on. And in the middle you can have a lot of symlinks which make things complicated.
31:22
So we don't have proper implementation that at the moment we have something which work in some case but not everything. All of this typically come from the fact we put the kprob on the open system call. That's maybe too high level where we only get the file name.
31:40
There is other projects like Landlog Linux security module which try to do that on a lower level in the kernel where they use Linux security hooks where they actually have access to the proper kernel objects and they want to track the high nodes, mounts, et cetera. So that will be something to explore,
32:02
to do something similar. Tracking networking was a bit difficult as well. So when we have a connect system call, we can see in the system call the IP, destination IP. But we don't have the full connection tuples there.
32:24
And so what we did is we added a few more kprobs on a specific function in the kernel to get the information we need. And that's the source is really similar to that come from another project from Weavescope where we did similar work.
32:44
Another thing, difficult part. Sometime we lost events. So I will explain two different reason why we can lost events. So BPF programs run synchronously, I will say.
33:02
A BPF program cannot sleep, cannot wait. And when it emit events to user space, we use a perf ring buffer. So a ring buffer can be overwritten if it is full. Then we just write over and we don't wait, we don't sleep. So we chose a specific size for the ring buffer
33:20
and if it is too small, it's possible that we just lose some events. And another reason where we can lose events is with k-readprobs. So Shukra explained before how kprobs work where we put a jump instructions at the beginning of the function and we go to another routine.
33:43
K-readprobs is a bit similar but a bit different too. That's come from the fact that we don't know where the written instruction will be. That depends where it's called from. So what k-readprobs does is to save the function
34:01
where it come from before the function call and save that. But a function can be called several time in parallel. If you have multiple CPU, if you have preemptible kernels then it means you need to save several position and there is, you don't have infinite memory
34:20
so we, by default, k-readprobs is only able to save so many concurrent calls. And in BPF, there was a default value which come from this, it tends about, it takes this formula. And with the example of access system call,
34:41
that's where we had the most problem because the accept system call can take a long time if you don't have any incoming connections, it can sleep for hours. And if you have several process running the accept system call, then we will have several k-readprobs going on in parallel.
35:01
So we work with others on the kernel to make it configurable but still that doesn't solve completely this issue. Okay, and lastly, what we could do in the future,
35:22
maybe use twice print since it was not really an option on kernel 4.4, but that will offer maybe more stable API than using kprobs. We can change at any point during kernel versions.
35:40
And we have new BPF helper function that will help to do things more properly as well. And we could use the LLVM API directly instead of a fucking shell to stop this. So I will leave back to Sushara. Just some references here.
36:00
We have some projects that have already been done. If you have seen BCC already, you know this. There was BPFD, which is very recent, which looks something like what we have in tracelift. But in addition, that it also has a daemon and you can do much more actions with it. There was an older BPFD, which was also there,
36:23
which also looks something like us. And then BPF trace, which is very promising. It's by Alistair, I think. And the same with ply. So they both look, BPF trace and ply, they are like languages, which look like D-trace. And they directly generate BPF code
36:41
and they can be executed. Then landlock-lsm, which is an LSM, it is promising. It's probably upcoming in the kernel. And then auditd, which clearly resembles what we are trying to do with syscall monitoring here. When audit system is already there inside the kernel with kauditd as a separate module.
37:02
So you can leverage that as well. Some docs and tutorials about BPF, you can look at them later if you want to read something about this. And there has been research work done, obviously, on this. And you can also read about this later. So that's all. You can ask us some questions.
37:20
I would specifically like to thank Kin Vulk for working with us on this. And Ayago and Mikael and everybody else. Thank you. If you have questions, you can just ask.
37:44
Check. Thanks. Do you have some numbers on the overhead of TraceLeft? Some rough idea? Yes, actually TraceLeft has, thanks to,
38:02
I think you did that, there is a pprof endpoint in TraceLeft itself. You can actually profile TraceLeft as it's running. So you can check what's the overhead. I don't remember the exact numbers, but I have them somewhere. And they are somewhere, I think, in the repo itself.
38:22
I will check it. In the repo, there is a documentation directory. And in the documentation directory, there is a section called as performance and profiling, which tells you how to use it. I will say there are two possible sources where overhead can come from, from the BPF programs, but I think it's quite low.
38:41
I don't have numbers, but other projects make numbers of that. On the TraceLeft binary running in user space, I think that's where most of the overhead will take place and that's where the BPF will help.
39:05
Thanks.