We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

High-performance Linux monitoring with eBPF

00:00

Formal Metadata

Title
High-performance Linux monitoring with eBPF
Title of Series
Number of Parts
47
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Producer

Content Metadata

Subject Area
Genre
Abstract
Extended Berkeley Packet Filter (eBPF) allows for high-performance introspection of the Linux kernel execution. eBPF is widely available (part of the mainline kernel and enabled by most distributions), flexible (any kernel code path can be probed) and safe (driven from userspace and statically verified). In this talk, I will introduce eBPF, explaining how it can be used to track TCP connections in real time. On the way I will demonstrate it is possible to access eBPF from languages other than C (Golang) and remove undesirable runtime dependencies (LLVM compiler and kernel-headers). At Weaveworks we are using eBPF for the connection-tracker of the Weave Scope visualization tool.
37
System programmingSoftware engineeringXMLComputer animationLecture/Conference
TrailService (economics)Point cloudCore dumpTelecommunicationVisualization (computer graphics)Level (video gaming)Connected spaceSoftware developerOrder (biology)Operator (mathematics)Classical physicsExpandierender GraphSoftware engineeringSoftwarePoint (geometry)Multiplication signProduct (business)Real-time operating systemCodeRight angleGotcha <Informatik>WordCartesian coordinate systemCASE <Informatik>WorkloadComputer animationLecture/ConferenceMeeting/Interview
SpacetimeWorkstation <Musikinstrument>Kernel (computing)Single-precision floating-point formatCore dumpComputer programmingSoftwareCommunications protocolNumberDivisorLecture/Conference
BytecodeComputer programmingSoftwareInterface (computing)Distribution (mathematics)Connected spaceKernel (computing)Virtual machineFirewall (computing)SpacetimeCore dumpBytecodeMultilaterationComputer animationLecture/Conference
ExpressionCodeVirtual machinePosition operatorKernel (computing)Lie groupParticle systemComputer-assisted translationCore dumpLibrary (computing)WindowCommunications protocolComputer animationLecture/Conference
Virtual machineAssembly languageKernel (computing)Right angleFlagFunction (mathematics)Asynchronous Transfer ModeDampingNormal (geometry)Computer fontBefehlsprozessorCodeSemiconductor memoryEndliche ModelltheorieComputer animation
Core dumpSampling (statistics)Computer programmingComputer animationLecture/Conference
Computer programmingAsynchronous Transfer ModeRight angleEmailCommunications protocolType theoryParticle systemSheaf (mathematics)WordCASE <Informatik>BitComputer animationLecture/Conference
Kernel (computing)Event horizonJust-in-Time-CompilerFront and back endsFinite element methodLengthFilter <Stochastik>Field (computer science)Number32-bitAsynchronous Transfer ModeWordCore dumpEmailBitRow (database)Right angleSinc functionParticle systemComputer animationLecture/Conference
Kernel (computing)Event horizonFront and back endsVarianceRevision controlArchitectureProcess (computing)BefehlsprozessorMathematical analysisCrash (computing)Information securityPerspective (visual)Just-in-Time-CompilerIndependence (probability theory)Computer programmingArmPower (physics)VirtualizationFilter <Stochastik>Patch (Unix)Tracing (software)InjektivitätPoint (geometry)Fluid staticsSpacetimeKernel (computing)HistogramSoftwareMultiplication signInstance (computer science)MappingEvent horizonCodeComputer architectureSymbol tableBytecodeRevision controlFirewall (computing)CompilerLevel (video gaming)TelecommunicationVirtual machineEqualiser (mathematics)Data structureSoftware developerNormal (geometry)Interpreter (computing)Different (Kate Ryan album)BitInterface (computing)Core dumpClassical physicsSubject indexingSemiconductor memoryProgrammschleifeDebuggerUniversal product codeFreezingRight angleKey (cryptography)Module (mathematics)Software development kitCellular automatonCompilerForm (programming)Cartesian coordinate systemDivisorHypermediaSimilarity (geometry)Hydraulic jumpContext awarenessMusical ensembleHoaxGame theoryAnalytic continuationElectronic data processingRhombusComputer animationLecture/Conference
Process (computing)Connected spaceReal numberArithmetic meanReal-time operating systemComputer animationLecture/Conference
InformationGraphics processing unitTrailRun time (program lifecycle phase)Kernel (computing)ComputerConnected spaceKernel (computing)BefehlsprozessorExecution unitRun time (program lifecycle phase)Profil (magazine)BitMultiplication signKeyboard shortcutFitness functionPhase transitionField (computer science)Network socketInformationPhysical systemComputer fileFilm editingDesign by contractProcess (computing)File systemDirected graphRevision controlData structureTerm (mathematics)Principal ideal domainGraph (mathematics)Independence (probability theory)EmailLecture/Conference
BytecodeCompilerMultiplication signRun time (program lifecycle phase)CASE <Informatik>Group action3 (number)AreaLine (geometry)Computer animationLecture/Conference
System programmingXML
Transcript: English(auto-generated)
Hello everyone. My name's Alfonso Acosta. I'm a software engineer at Weaveworks. And today I'm gonna be talking about high performance Linux monitoring with eBPF.
But before I start, I would like to get a feeling about what kind of audience I'm gonna be delivering my talk to. So who knows what a TCP dump is? Raise your hands. Everyone, just like I expected. Who knows what BPF is?
Almost everyone. Oh, we have a high level here. Who knows how to code in C? Everyone as well. Okay, great. And who knew what Weaveworks is before coming here today? Okay. All right.
So as I said, I'm Alfonso Acosta. I'm a software engineer at Weaveworks. And we're a startup company whose ultimate goal is to simplify the development and operation of microservice-oriented applications, which are typically containerized.
And one, we have a software as a service product called WeaveCloud. And among other services or cloud offering, we offer something called WeaveScope, which among other things, does visualization of communication between containers, networking communication.
So in order to do that, we need to track TCP connections in real time. So in this talk, I'm gonna be introducing BPF, or what we call today classic BPF. Then I'm gonna expand on eBPF
and what that brings to us, in particular on our Weaveworks use case. And finally, I'm gonna talk about how we've incorporated eBPF in WeaveScope in order to do high-performance monitoring of TCP connections. Even if it's a short talk,
I would like this to be interactive. So please, if at any point in time you have a question, ask it right away. They'll be handing over a microphone, okay? So to start with, I'm gonna be asking you a question. Can anybody tell me what this does?
Everybody has used TCP dump, right? So what does it do? Okay, it dumps HTTP traffic. And furthermore, can you tell me how this works on Linux? Anybody can tell me how this works?
Yeah? Can you speak to the microphone, please? Well, I should try to match traffic whether or not it's a station port. Check whether or not it's a station port. All right.
But yeah, somebody's gonna elaborate on that. I'll give you the details. What it is, TCP dump opens an AF packet socket, get it raw data. Then it pushes down a BPF program, which is a small program that says, look at offset 10 and look for the TCP protocol
number and then go to the other offset. Okay, cool, yeah, that's how it works. So he knows all the details. Basically, how this works is instead of forwarding every single packet from kernel space to user space and then based on the packet you filter on user space,
it happens on the kernel. And we do that through something called BPF. Otherwise, it would be super inefficient because you would need to forward every single TCP packet to user space and filter it there.
So in 1992, the Berkeley Software Distribution, Unix BSD, introduced something on a paper which is really, really interesting and I encourage you to read. They introduced something called Berkeley Packet Filter.
And its goal is to filter packets in the kernel based on a virtual machine on a program, a byte code program executed by a virtual machine which runs on every packet for a given networking interface you choose
and it decides whether the packet is filtered or not. In that way, you don't need to pass every single packet to user space and decide the filtering there. So it's really efficient.
And just for the sake of showing an example, let's do that here. If I do this, I'm executing TCP dump and, uh-huh, I don't have a network connection. So I'm not gonna be able to show you that.
But anyways, yeah, I'll do that later. So in practice, this is how it works. TCP dump uses a library called libpcap which works on Unix and also on Windows
but on Windows, it doesn't use BPF. It passes something we call the pcap filter with the syntax which says, hey, I want packets which are from protocol TCP and destination port 80. libpcap compiles that expression into BPF byte code.
It's injected in the kernel and the kernel starts applying running that virtual machine in every single packet and based on the return value of that byte code, it will filter the packet and pass it back to TCP dump.
And then TCP dump will dissect it and show it to you. And actually, this is the, let's say, the assembly language of BPF. It's a limited virtual machine but still a virtual machine. And in fact, if we wanna,
they were requesting a bigger font so let's make it bigger. Is it big enough now? Almost.
Is it big enough? All right. Maybe you didn't know this but if you pass the D flag to TCP dump, it will output the assembly language of the BPF filter
which is passed to the kernel. And this is it. We're not gonna go through the code but basically what this does, the virtual machine has different addressing modes and it has a scratch memory region but the main memory region,
if you compare it to a normal CPU model, will be mapped to the packet in every single execution of the VM. And I created a couple of sample programs
which do exactly what that TCP dump execution we saw but coding the BPF filter by hand so that we can see how it works. Let's look, for example, at the, yeah,
we're not gonna look at it. I actually have it locally so I think I have it locally. So we can, yes, right.
So what we have here is we're loading the ethernet header plus nine bytes which will give us the protocol of the IP package.
Then we compare it against the type of protocol we want which is TCP. If it's TCP, we will continue checking. If not, we will go to reject section, so on and so forth.
And the same kind of program works for OS 10. So let's see. In this case, we're using a little bit
more sophisticated addressing modes but it's basically the same filter. Actually, out of curiosity, just as an anecdote, I don't know if you can see it here, but the VM has a really, really, really specialized
addressing mode which is multiplying, getting one byte from the packet, getting the lower four bits, and multiplying it by four. Can anybody tell me what's the purpose of that?
Knowing what you know about IP since everybody knew about TCP dump and networking protocols. Any guesses? Let's let him answer. Yeah, maybe what?
Okay, I think somebody in the first row knows the answer. Uh-huh, right. Yeah, that's the right answer.
So a well-known field in the IP header is the length. And the length is expressed in number of 32-bit words. So what this does, if you place here the offset of the IP length field, it will get the lower four bits
the length is expressed by four bits, and multiplied by four and that will give you the length of the IP packet. Just so you know how specialized this is. This is completely specific for networking filtering.
Okay, so now we know a little bit about what BPF is, how maybe not knowingly you were using it when filtering packets and investigating what was happening on your networking interface.
But now let's talk about eBPF. eBPF stands for extended Berkeley packet filter. And actually since it was introduced, people are referring to it as BPF and to what we saw before, the TCP dump use of it as CBPF or classic BPF. eBPF comes with a much richer virtual machine
based on 64-bit registers. It has 10 64-bit registers. In classic BPF we only have an accumulator and an index register.
And thanks to that more powerful virtual machine, it's easier to compile and to have a JIT just-in-time compiler from a normal CPU code. And most importantly, nowadays in the Linux kernel,
there's no normal BPF virtual machine interpreter anymore. The classic BPF is transpiled, you can call it that way, to eBPF. So nowadays, even with TCP dump, you will be using eBPF.
Again, maybe unknowingly. But the most important feature or features of eBPF, it's not that it has a more powerful virtual machine, it's that it gives us extra features apart from networking filters.
One of them, which is what I'm gonna be talking about today, is dynamic tracing. And it offers other features like maps and events to let you communicate with your eBPF program in a more efficient manner, lowering even farther the communication
between user space and kernel space. Also, it's safe, in the sense that there's static analysis on your eBPF program through a kernel, what they call an internal verifier,
so that your eBPF programs cannot crash the kernel. So the memory which is being accessed is monitored, loops are not allowed, which is a pretty big limiting factor, but ensures that your kernel won't crash when it executes an eBPF program. And eBPF, apart from having a more powerful virtual machine,
makes use of existing kernel technologies. One of them is kprof, which operates in a similar way as what you do with a debugger on user space.
Basically it injects some code at any point in the kernel, it replaces an instruction with a jump, it will jump to your probe, execute whatever you want to execute, inspecting things in the kernel and whatnot, restore the context and continue execution.
So before eBPF, and actually right now, because kprobes can be used independently on eBPF, you typically would use them in a kernel module, for instance, you needed to work with them on kernel space.
That means that they're unsafe, in the sense that you can crash the kernel with them. And they're architecture dependent, because the code of the kprobe needed to be coded in whatever CPU instruction set the kernel was running on.
And they're pretty fragile, because you inject a kprobe at a symbol plus the offset. And different kernel versions will have different symbols, data structures will be slightly different, so you need to be super careful about that.
A patch or a workaround to try to make up for that drawback is to use something called trace points, which are fixed injection points in the kernel. But I think that goes against the flexibility
of doing dynamic tracing. Using trace points is basically doing static tracing, because they're fixed, right? So when using kprobes with eBPF, which allows us to do dynamic tracing, instead of injecting a kprobe in kernel space
or on a kernel module, and injecting your native CPU code, which can break and crash a kernel, what we do is we inject a piece of eBPF code, executed by the eBPF virtual machine.
It's safe in the sense that it won't cross a kernel, assuming that the static analysis applied to that eBPF bytecode is correct. It's safe in the sense that it won't crash a kernel. Of course, it's not safe from the security perspective that maybe you will be able to reveal details about the kernel, which shouldn't be available to every user.
But the nice thing is that it's architecture independent. So if you write your eBPF program for, let's say, Intel 64-bit, it should run on ARM as well, because it's executed by a virtual machine.
But unfortunately, it will still be fragile. Why? Because kprobes are kprobes. You will still inject a kernel symbol plus an offset, which means that even if it's an eBPF piece of bytecode, which is architecture independent,
it very much depends on the API version and data structure versions. eBPF comes with extra features, which are maps. We mentioned before how we use BPF filters to transfer only the packets we're interested on
from kernel to user space. So eBPF introduces something called maps, which lets your eBPF program make a summary of what's happening on the kernel, insert it in a map. For instance, you want to have a histogram of network latency and only transfer it from time to time
to user space to be printed, for instance. In that way, you don't need to transfer every single event and create the histogram in user space. eBPF makes use of an existing kernel feature, which are perf events,
which let your user space program be informed about things happening in the kernel without needing to do any polling. eBPF comes with a compiler toolkit, which is called BCC,
which is coded in Python, and it simplifies the development of eBPF by quite a lot. And here's where the, well, I'm just gonna spend a couple of minutes talking about how we're using eBPF at with. In with scope, we need to track
all these connections in real time, meaning that if a process A connects to a process B, we'll represent it through an edge. If that connection disappears, we will remove the edge, and we need to do that for any processes or containers being monitored in the cluster.
Before eBPF, we were doing this by polling the proc file system, which is super racy and CPU intensive. You need to periodically go through proc, go through proc, go through proc, and that is super, super expensive. Plus, the proc file system is not made
to provide you with that information. It's played across different files, so you won't read them atomically and so on. So you cannot catch short-lived connections. You have contract, on the other hand, which will tell you about connections, but won't tell you about the PIDs
of the processes being involved. So you cannot draw that graph we need to draw. We started with a BCC-based tracker, but that gave us quite a bit of problems because it has runtime dependencies, it comes with an LLVM backend, and it depends on the kernel headers.
And also, we have the problem of fragility with kprobes, which depends on the kernel version and data structures changing all the time. So how we solved this is in cooperation with Kim Volk, who are organizing this conference. We created GoBPF, which are Go bindings for BPF.
We thought that the runtime dependency on Python, scope is coded in Go, so this was a really good fit for us. And we implemented an offset guessing TCP tracker. Basically what we do is, as an initial phase on scope,
we make connections to known ports in which we control, we know what the fields of the socket data structure in the kernel should look like, and we evaluate what information we're getting by different offsets. So we adjust dynamically to the data structure
of the socket in the kernel, we thought, depending on headers, and we thought, depending on kernel versions. This is a bit of a complicated guess, but it lets us be version independent. And this is where we stay in the ecosystem in terms of using a BPF.
It's much, much simpler to use, but of course it gives us a lot less features. And that was me. We have time maybe for one question, or not at all. Yeah, any questions?
Yeah, one question. Is GoBPF a compiler? GoBPF, it invokes, I believe it invokes the compiler, yeah.
I think Alban is here, he knows more about it. He can actually answer that question for you.
Okay, so you can choose whether to use a compiler or provide the bytecode yourself. In our case, we're providing bytecode because we don't wanna depend on the compiler at runtime.
Any other questions? Okay, thank you.