We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

A gentle introduction to [e]BPF

00:00

Formal Metadata

Title
A gentle introduction to [e]BPF
Title of Series
Number of Parts
47
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Producer

Content Metadata

Subject Area
Genre
Abstract
BPF is a Linux in-kernel virtual machine that is used for networking, tracing, seccomp and more. This talk will give an introduction to the extended BPF subsystem in Linux, an overview on how it works, show available tools to work with and explain possibilities as well as limits.
37
System programmingComputer programmingFile systemMereologyCartesian coordinate systemKernel (computing)Core dumpFrame problemSpacetimeJust-in-Time-CompilerLevel (video gaming)Cloud computingClassical physicsFormal languageComputer fileVirtualizationParameter (computer programming)Category of beingFerry CorstenCASE <Informatik>MappingRevision controlReading (process)BefehlsprozessorTable (information)Decision theoryPhysical systemPointer (computer programming)Matching (graph theory)Different (Kate Ryan album)Type theoryInteractive televisionKey (cryptography)Letterpress printingElectronic mailing listCodeFunctional (mathematics)Data storage devicePoint (geometry)BitTracing (software)Task (computing)Multiplication signSoftware developerFunction (mathematics)Buffer solutionIntegrated development environmentBytecodeFlagContext awarenessGroup actionRing (mathematics)Compilation albumInformation securityMessage passingVirtual machineGotcha <Informatik>Machine codeFirewall (computing)Computer architectureState of matterOverhead (computing)PlotterString (computer science)XMLComputer animationLecture/Conference
Cartesian coordinate systemFile systemComputer programming2 (number)Ferry CorstenKernel (computing)CASE <Informatik>Loop (music)Computer wormComputer filePhysical constantSource codeFront and back endsWritingSystem callFormal verificationPhysical systemVideo gameFunctional (mathematics)Interactive televisionDemo (music)Type theoryMappingMessage passingField (computer science)Attribute grammarSpacetimeLevel (video gaming)Gotcha <Informatik>Reverse engineeringExtension (kinesiology)Similarity (geometry)Semiconductor memoryRule of inferenceWebsiteProcess (computing)Limit (category theory)InterprozesskommunikationRight angleSound effectMultiplication signBootingCodierung <Programmierung>Lecture/Conference
Doubling the cubeField (computer science)Source codeLogical constantData storage deviceRevision controlComputer configurationNamespaceComputer programmingStructural loadSet (mathematics)Error messageOperator (mathematics)OpcodeBitType theoryElectronic mailing listDifferent (Kate Ryan album)MappingData typeFormal verificationCompilerPhysical constantTable (information)Level (video gaming)Attribute grammarForcing (mathematics)Kernel (computing)Cartesian coordinate systemContext awarenessClassical physicsMixed realityCodeInformationFlagModule (mathematics)BootingComputer fileSystem callCore dumpLoginFunctional (mathematics)Object (grammar)Graph coloringArrow of timeProper mapSocial classLine (geometry)ChainMacro (computer science)Maxima and minimaCodierung <Programmierung>NumberMathematicsSpectrum (functional analysis)Latent heatBit rateTape driveMenu (computing)Game controllerLecture/Conference
System callComputer programmingDemo (music)Raw image formatBitSoftware testingFunction (mathematics)Revision controlMereologyElectronic mailing listFrequencyMultiplication signRepository (publishing)Point (geometry)Reading (process)Keyboard shortcutSynchronizationSpacetimeKernel (computing)Object (grammar)Lecture/Conference
Computer programmingBuffer solutionUnit testingVirtualizationSoftware testingComputer configurationPattern languageSoftwareSupremumProjective planeSource codeWeightMereologyIntegrated development environmentComputer hardwareTheory of relativityInternet service providerKernel (computing)Network socketLecture/Conference
Task (computing)Type theoryDisk read-and-write headComputer programmingSoftwareLecture/Conference
System programmingXML
Transcript: English(auto-generated)
So, my name is Michael, I work at Kienfolk, we do Linux system level cloud software. You maybe know by now. Who here has used PPF before?
Okay, maybe 30%, 25%. So yeah, this talk is really focused on the basics and I hope it helps you to understand the low-level concepts and the language if you want to start with it.
I will touch the history a bit and then go over architecture, instruction set, development tools. So, what is PPF? Some people describe it as internal bytecode virtual machine, others as engine, so it's
a small language that can be used in the Linux kernel for certain tasks and it has been there for a long time. So today there is classic and extended PPF and when I say PPF, I mean extended PPF as
found in the Linux kernel today. Actually classic PPF when used today is internally translated to extended PPF, so what actually runs is extended PPF. If you have used TCP dump and probably that's more people than the group of PPF users,
you have already used classic PPF. If you haven't seen this talk this morning, it shows how you can get PPF output from TCP dump and so on and explains a couple of things about classic PPF, so I will skip this part of my presentation and encourage you to just go back to the talk from tomorrow,
this morning. So today we have extended PPF compared to the classic version. It has a rich instruction set, more features and more use cases. Most prominent use cases are networking, express data path, tracing, so for example you can
attach PPF programs to trace points or K probes which are facilities in the Linux kernel to trace program flowers to debug things and security.
The main design properties are that it's very fast, equal to a native code, there is a JIT compilation in the Linux kernel and when extended PPF was designed, that was
one of the main goals to make the language match how modern systems look today, so for example match registers which are available in BPF to actual registers of modern CPUs.
And another design decision was to make it possible that calls into or from BPF don't have any so-called for in function overhead, so your code actually runs in the kernel context.
It's a general purpose instruction set, there are 11 registers, register 10 is the program counter and you have 512 bytes of stack which is not a lot, but it's surprising how much stuff is possible with even such a limited environment.
The registers are grouped as follows, the first register is the return value of functions and it's also used as the exit value of your BPF program, then there are registers
as arguments to internal functions, registers that are call saved so you can use those for your stuff and register 10 holds the frame pointer.
EBPF also has maps, so that's for different kinds of maps and maps are key value stores so when you create a map you can define what kind of data you want to put in there. And those are also used to make it possible to have interaction between kernel space
so the BPF program that runs in the kernel and the user space application in user land, so you can use it for both, either you only use it in your BPF program or you use it for example to pass data to user land or to read data from user land.
So when you think about a firewall, maybe your user land application wants to signal that port 80 should be blocked, so you could have a table which is a list key and blocked or not blocked, key is part and value is a flag for blocked or not blocked
and your program in kernel space could check this map to verify what it should do. Second thing is helper functions, so depending on what type your program has, you have a couple of helper functions by the kernel, so one helper function for example
is print k, so you can write to a trace pipe, it's not the kernel ring buffer that you can see with the message, but there is a trace pipe file in the kernel debug file system
and there you can write messages too. Third thing is Tails calls into BPF programs, so you can chain programs, that's really useful when you have complex applications and want to have some kind of waterfall design,
so you could load a lot of programs and then decide from your first program which handler should take over the payload. Finally, there is the Psoido file system BPF file system, this can be used to pin programs
or maps, usually programs and maps are tied to the life span of the process, so when you process Tails, the BPF program or the map or whatever you have created is deleted
as well, if you pin it to the file system, that won't be the case, but then only when the last reference is gone, the kernel will actually delete the program. And that's also useful for inter-process communication, so again you can think of
some scenario where one application sets up rules, programs and second application checks the BPF file system to find out about what is there and to use those.
When you have your program or when you do your BPF call to load a program, the kernel verifier comes into play, so for all programs that you load, the verifier has to make sure that the program is sane and follows the rules, because you don't want
that the program, for example, enters an infinite loop and never exits again, this would mean your kernel is blocked and the system is dead, so there is an instruction limit. The verifier is pretty complicated, so there is a very long comment in the source
code, which I can recommend as documentation, but in my experience, it often still needs back and forth between writing code, trying to load it, getting some cryptic message
from the verifier and trying to find out what is going on, so this can be a bit painful when you are just studying and don't have a feeling yet for what the verifier maybe could be complaining about. All interaction with the eBPF subsystem happens through the
syscall BPF, so you have the single syscall, and what you actually want to do then is defined in the BPF attribute union. It's a pretty large union today with a lot of
fields which can potentially be set, so no matter if you want to load a program or create a map or look up a value of a map or delete a map or whatever, you always use the syscall, and you have to set the proper fields in the attribute corresponding
to what you want to do to get it done. All that together, the big picture could look like this. This is a use case where we attach a kprobe to a kernel function,
so it looks similar like this. It starts with a program that you write. You can use LLVM to translate it to an ELF file. LLVM has a backend for BPF, so that's pretty useful because as we will see in a moment, writing BPF is not so easy and nice. Once you have
this program, you can load it with the syscall, the verifier will check it, and if it's okay, it will be loaded. And then in the kprobe scenario, the kernel will trigger whenever
the kernel function which the program was attached to is accessed, and your program can, for example, write to maps, and the user space application again can also interact with those maps. Before we look at a program written in C, I will show you how BPF instructions
look low-level. This is a very simple program that does nothing except returning 11. What we do here is we write the immediate value 11 into register 0. We have learned before
register 0 is also the return value of your program, and then we exit the program. And that's actually the most minimal program you can write in BPF, having only an exit
instruction would not work because the verifier would reject the program, as only values can be read that have been written before. This is to protect kernel memory, and the exit instruction reads the value in register 0, so we have to write to it before.
If we would want to load this program, we could do it like this. So we take this BPF attribute union and set a couple of fields. First, I set the program type. In this case, where we have this demo program, it doesn't really matter. But, of course, if you do
write your real applications, you have to choose what you want to do. Then we give the size of the program, we give the instructions, and we also set the license. We do set the license
because the kernel will not allow us to access a couple of things like helper functions. If it's not a GPL license, you can compare that to what you maybe know from the Linux modules. There, you also have to declare what is the license of the module.
Finally, we can load the program. We do this with proc load. We pass the previously defined attribute, and if the call is successful, we get back a file
So, how does such an instruction look like? It has five fields. First is the opcode, one byte, so what do you want to do, and then
destination register, source register, offset, and immediate constant. It depends on what kind of operation you do. You don't always need all the fields, of course. If you want to write an immediate constant into the destination register like we have done
in our program, we would only set destination register to register 0 and immediate to 11. We don't have to set the other fields. The opcode is actually a mix of
fields. The define for the move instruction that we have used before looks like this. First, we say we want to move a value. We then say BPFk, and this goes back to
classic BPF days. In this context it means we want to move an immediate value. BPFk was the same in classic BPF, and there is another flag that means we want to use a source
register. When we have these three things in mind, those are operation code, source bit, and instruction class. There are two types of opcode encodings, as you can see, so it depends
on which instruction you use. If you would have a load and store instruction, it is two bits, because here you have more options what the source or the size could be. Here it is only immediate or source register. Here it is half word, word, double word, and so on. That's why
we need more bits here. Then there is the verification log. You need the log to figure out what the error is. I would just always set this. When you do it and it is large enough, that is important.
Then you get back the verifier log, and this exactly tells you what the verifier has been seen in your program, what was loaded, and if there is an error, also what is the problem with
the application. If you would set the level only to one, you wouldn't get those register dumps after each instruction, so you have to decide if you need this. I have said this before, some programs must match the kernel version. K-probe programs
must do this, because the data types that are passed to the probe are version specific, so fields and offsets could change. You have to tell the kernel, yes, I want to run this
program on this version, basically guarantee that it will work. We have seen maps before, so there are different types of maps. I think it is between 10 and 20 today.
One special type of map is the proc array. It is a lookup table for other BPF programs, which are loaded, so you can use a proc array to, for example, create a list of programs,
of handler programs, and then if you want to chain programs, you can make the first program lookup or tail caller program from a proc array, and I have also already said user space, kernel space, data passing is what is done with maps.
Map definition could look like this, so you have the type, you have a key size and a value size, and a maximum number of entries.
Actually, there are more fields today, and it also depends a bit on what loader is loading the map. So, for example, TCBPF is also able to load BPF handlers for traffic control,
and they have a field which allows you to pin a program to a namespace. LLVM learned about BPF in version 3.7. That one is important. You need to inline everything,
and you best define a macro because inline alone doesn't force the compiler to inline code, so you should set attribute always inline, only then you can be sure that the compiler does
inline the code. You can do printk debugging. Clang allows you to add trough info. That's nice because you can object dump your object after build and
compare it to what the kernel verifier gave you back in the verifier log. Since the object dump is C code annotated, you can get at least a clue where the problem is in your program. I have only about four minutes left, so maybe I have to
rush through things a bit. I will give you a quick demo. So, here I have a rather simple and stupid program which does
count syscalls. For that, I use a BPF program that gets attached to the raw syscall sycenter. No matter which syscall you use, this trace point will trigger every time. Since this is too much output, I limit it to the read syscall which has the ID 3. I will just start this.
I do this for only a single PID, so this is this bash here. You can see the PID is the same
as here. When I do sync here, we can see the read syscalls are detected. I did this with
for the BPF program, I used LLVM to translate it to an ELF object. For the user space part, I used GoBPF in the small tool to load the program. You could also use BCC.
I would especially recommend it when you just want to start playing with it because it is a very nice Python and Lua bindings which make it really easy to use BPF for tracing purposes.
They also have a nice list about which Linux version introduced which BPF feature. That is the best place to go if you have to find out if you can use a feature on the kernel that is in your Linux distro, for example. BPF test run is great, available since Linux
4.12. With that, you can use your SKB and XDB programs and give them to the kernel and provide data which will then be passed to the program as packet data, so it is possible to
unit test your programs without having, for example, network hardware which is capable of running BPF programs. This would look like this. You again fill the BPF attribute, the test part
with your program, you call it and if set, you get the data, how the network packet, for example,
looked like after it was handled by your handler, the return value and the duration. That's pretty nice because that's not something which was possible before. You had to set up your virtual network environment and then start programs and intercept the traffic and with that
you can test the handlers alone and just pass in a socket buffer. There are a couple of sysdl options. Regarding tracing, there is an excellent resource by Brent and Craig,
and there are a couple of projects which are prominent users of BPF, so there is a lot to learn in the source code. The talk this morning was about TCP Tracer BPF, actually.
And that's it. Thank you. Would you say that eBPF is suitable for modifying packets
if they go through the kernel? Yes, you can modify packets. I don't know out of my head which program type you have to choose because I mostly do tracing and don't have real experience with networking, but I think most users actually use it for networking.
But you have to look up the details, I don't know. More questions? Okay, thank you.