BPF and the future of the kernel extensibility - TIB AV-Portal

BPF and the future of the kernel extensibility

00:00

16

Related Material

All Systems Go!

Starovoitov, Alexei

Formal Metadata

Title

BPF and the future of the kernel extensibility

Title of Series

All Systems Go! 2018

Number of Parts

50

Author

Starovoitov, Alexei

License

CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/43137 (DOI)

Publisher

All Systems Go!

Release Date

Language

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

To bring cool ideas to life, the Linux kernel and user space need to work together. The core kernel is a stable ABI base, a common denominator for everything that builds on top. In contrast, BPF is a specific know-how, a secret sauce of the cool idea. The Linux kernel needs BPF to stay relevant, and BPF has to become friendlier to programmers. This talk explores the steps taken towards this long-term goal, from recently introduced BPF-to-BPF functions and type information to bounded loops, memory allocation, and beyond.

All Systems Go! 20182 / 50

1

16:16

Chef in Strange Places

2

35:40

BPF and the future of the kernel extensibility

3

24:31

Being compliant with Open Container Initiative Spec

4

31:10

A debugger from scratch

5

31:15

2018 Desktop Linux Platform Issues

6

31:28

Using systemd to high level languages

7

30:22

Using Machine Learning to find Linux bugs

8

24:43

To Run an App With Guarantees We Must First Create The Universe

9

28:21

Titus: Adventures in Multi-tenant Scheduling

10

23:46

Thunderbolt 3 hardware enablement for GNU/Linux

11

25:50

The State of Your Supply Chain

12

25:18

The Future of Networking APIs

13

25:34

State of systemd @ Facebook

14

25:53

Scale Your Auditing Events

15

24:18

Running Android on the Mainline Graphics Stack

16

43:54

Resource Control @FB

17

47:31

Replacing Docker with Podman

18

42:57

Portable Services are Ready to Use

19

15:17

Playing with casync @ instagram

20

18:15

Peer to peer OS and flatpak updates

21

23:21

Path-agnostic binaries, co-installable libraries, and How To Have Nice Things

22

26:10

Past, present and future of system containers

23

20:38

Passive filesystem verification

24

15:29

25

21:21

26

32:45

Netboot21: Bootloaders in the 21st Century

27

15:15

Monitoring Linux Capabilities in the Container using BPF

28

24:32

Monitoring File System Syscalls in a Distributed Architecture

29

34:59

Little Services, Big Risks

30

42:27

LinuxBoot and booting fast

31

17:50

All Systems Go! 2018: Lightning Talks

32

31:17

33

36:30

Kexec/Kdump under the hood: How to debug your crashed system

34

36:03

Is my system fast?

35

27:29

Is Cockpit Secure?

36

25:47

HTTP tunneling in Go using HTTP/2 streams

37

24:16

From Physical to Cloud to Container

38

26:31

Fluent Bit: Solving Logging Challenges for Cloud Native Environments

39

50:53

Flatpak, a technical walkthrough

40

24:01

41

31:23

Fearless Multimedia Programming

42

27:02

Efficient Network Analytics with BPF/eBPF using Skydive

43

25:18

Early boot provisioning systems

44

23:18

45

31:06

CRI-O: All the Runtime Kubernetes need

46

26:50

Container Run-times and Fun-times

47

24:56

Container Runtimes: draw some lines

48

39:11

Configuration Driven Event Tracing with Traceleft and eBPF

49

29:03

Cilium - Bringing the BPF Revolution to Kubernetes Networking and Security

50

25:58

systemd in 2018

Automatic playback

Speech

Text

Image

00:00

SpacetimeSystem programmingFacebookKernel (computing)FacebookDerivation (linguistics)Presentation of a groupKernel (computing)Meeting/Interview

00:41

Physical systemPresentation of a groupKernel (computing)Software developerImplementationFirewall (computing)SoftwareRobotForm (programming)Client (computing)Graphical user interfaceCompilation albumOpcodePrototypeAssembly languageKernel (computing)MereologyNumberProcess (computing)Software bugComputer architectureTerm (mathematics)Form (programming)Set (mathematics)ArmRevision controlFirewall (computing)Assembly languageProtein foldingSoftware developerBitMatching (graph theory)RoboticsDifferent (Kate Ryan album)outputSinc functionFormal verificationSpeech synthesisTracing (software)Interface (computing)Shape (magazine)Table (information)Virtual machineMultiplication signPatch (Unix)Classical physicsINTEGRALInteractive televisionFilter <Stochastik>Meeting/InterviewComputer animation

05:47

Drop (liquid)Physical systemReal numberQuicksortDifferent (Kate Ryan album)Open sourceSoftwareSpacetimeSoftware developerPhysical systemFacebookGreen's functionRootProjective planeTracing (software)Repository (publishing)Multiplication signType theoryFilm editingDrop (liquid)OvalComputer animation

07:03

Open setFacebookScalabilityStructural loadComputer networkCodeImplementationSpacetimeKernel (computing)Product (business)LastteilungFacebookDrop (liquid)Boundary value problemGame controllerQuicksortImplementationStack (abstract data type)SoftwareCartesian coordinate systemStandard deviationDirect numerical simulationLine (geometry)CausalityKernel (computing)Multiplication signCoprocessorSpacetimeProxy serverKey (cryptography)Core dumpMeeting/InterviewComputer animation

08:57

System programmingImplementationKernel (computing)Computer programComputer programmingKernel (computing)Module (mathematics)ArmMeeting/Interview

09:46

Computer programKernel (computing)Loop (music)HierarchySystem callTracing (software)Programmable read-only memoryProgrammschleifeCASE <Informatik>HierarchyVulnerability (computing)Raw image formatLoop (music)Device driverComputer programmingDifferent (Kate Ryan album)Error messageHookingTerm (mathematics)Data bufferModule (mathematics)LastteilungPresentation of a groupArithmetic meanOcean currentFreewareSet (mathematics)Kernel (computing)Cartesian coordinate systemLevel (video gaming)Graph (mathematics)Message passingStack (abstract data type)Socket-SchnittstelleInjektivitätMechanism designRevision controlNetwork socketInformation securitySimilarity (geometry)Reduction of orderAdditionWind tunnelSoftwareDataflowKeyboard shortcutComputer animationProgram flowchart

12:57

System programmingHierarchyCASE <Informatik>Different (Kate Ryan album)AlgorithmRun time (program lifecycle phase)Server (computing)ÜberlastkontrolleType theoryStack (abstract data type)TunisExpressionKernel (computing)HookingPatch (Unix)Data centerIP addressRoboticsFood energy

13:53

Network socketVariety (linguistics)ParsingComputer networkSocket-SchnittstelleData bufferPartial derivativeSpacetimeTelecommunicationRead-only memoryKernel (computing)Stack (abstract data type)InjektivitätStapeldateiAndroid (robot)Streaming mediaSystem programmingCASE <Informatik>Computer programmingComputer animationMeeting/Interview

14:43

Computer programFunctional (mathematics)Pointer (computer programming)Computer programmingDifferent (Kate Ryan album)Independence (probability theory)Just-in-Time-CompilerKernel (computing)Graph (mathematics)Parameter (computer programming)Meeting/Interview

15:18

Loop (music)Parameter (computer programming)Context awarenessComputer programMathematical analysisPointer (computer programming)Network socketTrailFunctional (mathematics)Computer programmingOvalKernel (computing)DataflowNetwork socketBoss CorporationMathematical optimizationSoftwareP (complexity)Software developerControl flowPointer (computer programming)Compilation albumContext awarenessSingle-precision floating-point formatCASE <Informatik>Perspective (visual)Primitive (album)Fluid staticsMathematical analysisStack (abstract data type)Graph of a functionTrailPatch (Unix)Goodness of fitLengthMathematicsFormal verificationCountingView (database)IntegerComputer animation

17:21

ProgrammschleifeObject (grammar)Sheaf (mathematics)Functional (mathematics)TrailComputer programmingBitSemiconductor memoryLeakMemory managementFormal verification

18:15

Computer programProgrammschleifeHash functionData storage deviceLocal ringVariable (mathematics)System callDifferent (Kate Ryan album)Electronic mailing listSystem callPhysical systemProgrammschleifeStreaming mediaShape (magazine)Term (mathematics)Condition numberComputer animation

19:29

AerodynamicsLimit (category theory)Programmable read-only memoryMathematical analysisFluid staticsCompilerComputer programAlgorithmCubeRegulärer Ausdruck <Textverarbeitung>InformationSource codePrototypeFunction (mathematics)CASE <Informatik>Different (Kate Ryan album)Point (geometry)FreewareCheat <Computerspiel>Associative propertyFormal languageJava appletCodeComputer programmingUsabilityAlgorithmMathematical analysisReal numberCubeKernel (computing)Multiplication signSpacetimeLoop (music)Limit (category theory)Meeting/InterviewComputer animation

21:17

InformationSource codeSimilarity (geometry)Formal grammarType theoryFile formatMathematical optimizationPrototypeFile formatAssociative propertyInformationSource codeType theoryFunctional (mathematics)Line (geometry)Proper mapKernel (computing)Mathematical analysisDigitizingMobile appMeeting/Interview

21:50

File formatSimilarity (geometry)Type theoryPointer (computer programming)Function (mathematics)Sheaf (mathematics)Data storage deviceLocal ringFlagAreaSpacetimeData bufferNetwork socketTask (computing)Extension (kinesiology)StrutPersonal identification numberBefehlsprozessorImplementationMultiplication signThread (computing)SpacetimeBitRun time (program lifecycle phase)2 (number)Data compressionData storage deviceReduction of orderOcean currentOverhead (computing)Local ringGroup actionKernel (computing)Computer animation

23:13

Computer programSystem programmingSoftware bugRecursionSystem callDeterminantElectronic mailing listDegree (graph theory)EmailBlogWebsiteProcess (computing)Table (information)Computer programmingException handlingSpeech synthesisMultiplication signFormal verificationTrailRun time (program lifecycle phase)Kernel (computing)SoftwarePhysical systemFacebookSource codeMathematical analysisNumberReal-time operating systemFunctional (mathematics)Fluid staticsLimit (category theory)Artificial neural networkJust-in-Time-CompilerCASE <Informatik>MereologyCategory of beingContext awarenessQuicksortHash functionCompilerAssembly languageMaxima and minimaProjective planeSet (mathematics)Graph (mathematics)WordBoss CorporationStack (abstract data type)CausalityLoop (music)Structural loadMultiplicationField (computer science)Computer animationMeeting/Interview

29:12

Formal grammarSimilarity (geometry)Type theoryFile formatSource codeInformationLimit (category theory)Programmable read-only memoryMultitier architectureHash functionLocal ringData storage deviceVariable (mathematics)System callFibonacci numberLibrary (computing)Kernel (computing)Primitive (album)BuildingComplex (psychology)Software maintenanceComputer programmingFormal languageSemiconductor memoryProjective planeMultiplication signBefehlsprozessorSet (mathematics)Software bugCartesian coordinate systemNumberUsabilityLine (geometry)Correspondence (mathematics)WeightConstructor (object-oriented programming)Rule of inferenceGraph (mathematics)System callRight angleMassBitCore dumpSinc functionWave packetPlanningFormal verificationCASE <Informatik>Computer fileSource codeCodeType theoryInformationNumbering schemeAssembly languageAreaCodeRepresentation (politics)SpacetimeMachine codeLimit (category theory)Integrated development environmentChemical equationMathematical analysisDirection (geometry)InfinityVideo gameMeeting/InterviewComputer animation

35:35

System programming

Transcript: English(auto-generated)

00:07

Welcome, everyone. My name is Alexey Starovitov. I work for Facebook. Together with Daniel Borkman from Kavalent, I maintain BPF subsystem in the Linux kernel.

00:24

By the show of hands, how many people here have never heard of BPF until today? Have not heard. Okay. This presentation for you as well.

00:41

No, seriously. This is for everyone. I will go from the past, how it began quickly, into what shape BPF takes today, and we'll give a glimpse of where we're going this year and in the future.

01:03

First of all, it's the most important to understand what are the goals and non-goals for BPF as a whole. So, first of all, it's a way to, first of all, safely and easily modify kernel behavior. Safety was in the BPF from the start.

01:22

It's kind of baked in in the architecture, and I think we did a pretty good job keeping the safety in check. There were, of course, two bugs over the years, but I think it actually worked pretty well. The easily part on the other side didn't work quite so well.

01:41

So, BPF is hard to use. It's the number one complaint we hear all the time, and that's what we're trying to address pretty much nonstop. Non-goals are equally important. When we just started pushing patches into the kernel

02:00

back in four-plus years ago, people were saying, oh, you're trying to do D trace in the kernel. That's what you're about. You introduce a dynamic tracing interaction into the kernel. If you're doing so, why don't you use a different virtual machine, like D trace used in Solari? So, like K tab did.

02:21

Then, networking folks saying, oh, no, the BPF is actually there to replace a VS, or it is to replace IP table. So, this constant fight for the turf was from day one because people misunderstood what BPF is trying to achieve. Yes, you can do dynamic tracing.

02:42

Yes, you can do just as full kernel introspection with a BPF, but that's a non-goal for the architecture and for instruction set as a whole. To look back, how we started. So, long days ago when BPF was introduced,

03:03

the classic one, by one Jacobson, it was pretty cool back then, but it had only two shapes. Whether you can use it as a truck or as a robot. A truck would be something that was filtered in the packet through the pickup interface, and the robot was the second filtering.

03:23

Both are cool, and second was still heavily used, but only of two forms. That was the underlying restriction of architecture, like speaking, speaking for themself. The classic BPF was designed with a packet filter in mind. It was instruction set for packet filtering,

03:43

whereas the extended one came as a generic infrastructure to extend the kernel. That's a fundamental difference. How we extended it. The development began late 2011. The first version was just in USA,

04:01

instruction set architecture, and little known fact, that first verifier was operating on reduced x86 instruction set. We've hacked GCC to reduce the number of things x86 can generate, but that didn't go quite well, because, well, for various reasons.

04:22

At the end, we just said, no, screw that. That does not going to work, since it's x86 really only, and we need to run on all architectures. So we reused only opcode encoding from the classic BPF. Why didn't the register or did some other modification? At the end, it started to collect classic instruction set,

04:42

yet it's still vastly different. So at the end, it looks more similar to x86 than to anything else, with input from all other architectures that are taken into account, where x86 ASA and ARM64 ASA were weighted the most in terms of influence to the BPF instruction set,

05:03

but quirks of x86 were removed, like 16 and 8-bit sub-registers. BPF has only 32 to better match with ARM64, and certain things were removed as well. We had the GC back end. It's still partially alive,

05:22

and some folks were interested in upstreaming it, but it never happened because of lack of the assembler, whereas LLVM had an integrated assembler, so that's why the whole thing was upstream in LLVM since 2015. Went through quite a few versions and changes, and the first version of the BPF in the kernel

05:41

ended in 2014. So just reflecting back to it, that was four years, and when it happened, when BPF was sort of unleashed to the world, this is how it looked. Lots of new things, new toy, can probably do cool things with it,

06:00

but there is no manual. People starting creating different things, and lots of really real, I would say, rocket ships were built. At Facebook, we've built Katran, Droplets, Pwnd, and a bunch of other things. I believe so far today, only Katran is open-sourced,

06:22

and outside, in the public repository, the tracing tools, bunch of VC tools, BPF trace, that's still in the very, very active development system time, BPF, and many others. On the networking side, Selium is probably the biggest open-source project

06:41

that leverage in BPF in the networking space, system peak in the tracing side, and many others. Because of this green roots type of the approach, all of the ships that were built, they kinda look the same. Once people figure out, let's say, Katran, how to use XDP in an efficient way,

07:01

everyone else started to copy it, and the other solutions, like a Droplet, they somewhat similar, similar to Katran. If you haven't heard of Katran, so this is a Facebook production load balancer. It's key advantage versus kernel bypass solutions

07:22

that it's leveraging kernel stack at the same time. So on the same host, this is a picture on the right, we're on the XDP in kernel packet processor that operates at the line speed, and the same host using standard cleanness networking stack, with what they call back-end application,

07:41

that doing all sorts of other stuff. So it's the old cores are equally loaded, and this is not something you can do when you completely bypass the stack, unless you start doing all sorts of extra hop between kernel user space back and forth.

08:00

So, as I was saying before, BPF, the main complaint for BPF today that it's hard to use. So I keep asking the question myself, if it's hard, why would I use it? I think the answer to me is to build,

08:20

to implement cool ideas that people have in mind, just user space solution is not enough. Kernel and user space need to work together to come up, to implement these great ideas. This boundary of user space and control space, it's only in the people's mind.

08:43

So we have to, what BPF made people realize, that now we can blend this boundary, that the solution can be in both, both kernel space and user space. What was happening before, when people thought that kernel cannot accommodate them, they either would develop a kernel module or completely bypass it.

09:01

So bypassed solutions is DPDK, obviously from Intel and SPDK, then the CLDB and C star technologies, then there is ODP and ARM, the snap switch, VPP from Cisco, then Google is doing the, you hold stuff and so on, why? I think the answer is kernel

09:21

is fundamentally hard to extend. So that's why I strongly believe that kernel, the Linux kernel needs BPF to stay relevant. How would the programs look today?

09:42

This is all the little helpers, that one program we have to figure out how to use, glue them together and create this nice program that we can all use. These programs are still loop-free, so this is a request that keep coming to us

10:01

that we're going to address, and lock-free. Lock-free meaning that there is no locking allowed. By design, the safety comes with a cost. We cannot just say, well, run everything in there. If we start, if we allowed loops from the very beginning, it was easy to make a mistake and create an endless loop,

10:21

kernel will hang and it would have no users. It would be no different from kernel modules. So as I was saying, the goal is safety first, then easy to use later. Easy to use, still work on it, but safety, that's why there are no loops, that's why there is no locking. And hooks, where you can attach it.

10:41

This is hierarchy of the different hooks that today present in the Linux kernel. Most of the tracing hooks here are read-only in terms of what they can do with the kernel. Recently, we added error injection facility, which I think is pretty cool. So from the tracing hooks,

11:02

we can modify the kernel behavior and inject fake errors to check all the facilities. Networking hook operate at different layers, like XDP is operating at the lowest raw packet buffers before the kernel stack, and sometimes the drivers can do anything with it.

11:21

And one of the use case, let's say for the XDP in some companies, what people call the big red button. When bad things are happening in the kernel, zero-day vulnerability is found. The zero-day BPF prevention filtering program can be deployed quickly within hours across the fleet.

11:40

That will stop this attack. TC layer is one above. Lightweight tunneling is yet another layer above. Reuse port is the newest addition, where we can, in a smart way, do the load balancing across different sockets. Flow of the sector is similar security enforcement mechanism.

12:03

But C group-based hooks made the biggest impact to the kernel and to BPF that no one realized, especially Daniel, when he implemented the first version of it. Now it's the fastest growing set of hooks, I would say.

12:22

Initially, it was just layer three hooks for ingress and ingress. Then we added bind connect, DDP send message, so I could create the device controlling hooks. Sock map is a special one that's not well known, but if you hopefully saw the presentation

12:42

from Thomas Graf earlier today, sock map is a hook that's used by Kavalin to implement really, really fast socket redirect between different applications at layer seven. And another set is TCP BPF that is somewhat different from the other hooks. They're acting at the run time instead of compile time.

13:05

And with them, we can fine tune, their main purpose is to fine tune congestion algorithms, TCP congestion algorithms for cases that congestion algorithm cannot express. For example, in a data center,

13:20

you would have a different mean RTT if your destination server within Iraq or if it's in a different data center. In TCP, you cannot really express, so if you're a big company like Google, you can have a proprietary patches that understand how IP addresses are created and use that or you can use TCP BPF to give this extra knowledge

13:44

to the kernel how to fine tune the TCP stack for the most optimal throughput. So this I'm going to tell you, I guess I will skip all of this. The helpers, eh, not that interesting.

14:03

There's a lot of it, a lot of it and the stuff is keep being added. Yeah, just a lot. So the one that's still probably the biggest obstacle

14:25

and that some of the BPF users hate the most is a verifier. What they say is to write a program, they need to constantly fight verifier to let the program be accepted.

14:41

So that's unfortunately exactly the case today. But what happened recently that we're only realizing today, about six months ago, we introduced BPF to BPF calls when within one program, we can have different sub functions

15:02

and create JIT dysfunctions as independent kernel functions with arbitrary arguments and arbitrary return value and we were able to teach verifier to understand this whole graph of different programs with arbitrary pointers that being passed

15:21

between these different functions. So just imagine from before we were before, we had programs with a fixed context that was like this could be pointer and return like single integer and that was a scope of verification. From this, we went to arbitrary graph of functions with arbitrary pointers passed between each other.

15:43

When we started implementing verifier, it just seemed impossible, like this is non tractable problem, but we made it and we still believe that it is safe. So this changed the mind of the developers.

16:01

So what we're going right now, Joe from Cavelli now adding the pointer tracking, let's say primitives to the verifier. So we'll be able to do unthinkable things before, like the first use case is to return a socket

16:22

by the helper. So both from XDP and the TCL layer on the networking stack, we can look up a socket that's controlled by the bigger networking stack, return it back to the program. And the verifier will make sure that the program will release it back to the kernel because of reference counting. It will make sure that this pointer does not leak,

16:43

that we don't accidentally return and so on. So it analyzes the both control flow and the data flow of the program, tracking all the pointers and making sure that the program is valid from this data flow perspective. I would say some of the standalone static analysis tools

17:00

do this as well. Most of the compilers do not because they don't care, like it doesn't matter, like from optimization performance then standpoint of view. Static tools do, but doing this in the kernel, that's pretty revolutionary. One, it's very close to land. I hope it would be landing today, but hopefully it will land like few days from now so the patch is practically ready.

17:23

What it will allow us to do is to introduce malloc functionality to allocate and free object from the BPF program and make sure that we don't leak memory. Just imagine how cool is that, to have a safe memory allocation, lock, unlock.

17:41

So before we couldn't do like spin lock, like all the programs in one non-preemptable section, in one RCU section. Preempt disable and RCU lock is done outside of the program because we couldn't track this. Now with this functionality, with this smart bits that are being added to the verifier, we can reduce the RCU section,

18:00

we can't do RCU inside the program, we can't do like spin locks and whatever else. So this is huge. But more coming. Loops are the hardest to do and well, technically, I believe the terminating halting problem is what then decidable.

18:22

We're going to solve it too, with conditions of course. There are two proposals from Kavalient and SolarFlare folks. We've been baking them for about a year now, trying to decide which is the long term going to work.

18:41

Both actually have the pluses and minuses and not in the shape yet to merge upstream. But the work is continuing it so everyone is pushing hard. So hopefully next year, when we have an axle systems go, come and say look, you can do loops now.

19:02

But that's not it again. Lawrence is, we have, well, we discussed the indirect calls, so basically the tail call stuff that,

19:20

well anyway, so there is so much stuff that's going on right now. This is just a short list of the features that different companies and different people are working on all of them I think equally cool. Different use cases for each, this is coming. At the end, today we have this crutch

19:43

inside the verifier that's only allowing 128 instructions in one program and the program are limited to 4,000. Why was it the case? We couldn't do anything better. We just did 4K because that was the limit for the classic and 128 because we increased it like five times.

20:04

That's how it was. So now we get in bold and the next target is to get to the one million instructions. So the programs will not be your small log free, loop free programs, but something big that the real Rubik's Cube algorithm

20:23

will be expressible in it. I do mean it, that will happen. But that's not it, so still more and more to go. Another pain point, ease of use as I said.

20:41

BPF gets its performance from JIT. As folks who try to debug Java or Node.js know that doing performance analysis for JITed languages is a pain in the butt. In kernel with BPF it's even harder because JIT is done by the kernel and not by user space.

21:03

So if it was user space, user space can tell the performance tools, the association between JITed code and original. When the kernel disassociation is lost. What kernel runs pretty much has nothing to do with the code that was written initially. So we need to bridge this gap and keep association

21:21

all the way from source codes, through the instructions, through all the optimizations that the kernel does to the JITs, everything that JIT does and tie it all together, return it back so we can finally do proper performance analysis and improve the debug ability. So type information is coming, source line information,

21:42

function prototypes and many other things. So type information we're doing through what we call a BTF. It stands for BTF type format. Pieces of it already landed. The most interesting bit will be when we start converting VM Linux and embedding BTF time pin

22:01

for inside the VM Linux, it will add the extra step. Currently DWARF for VM Linux is about 300 megabytes. The BTF is currently currently 10. So it's 30X reduction but still too high. Our target is reduce it down to one, one megabyte only.

22:20

But this compression currently takes five minutes, not acceptable. So once we get to seconds runtime and one megabyte, then we'll get it upstream. Cgroup local storage is another, we're stealing ideas obviously from user space.

22:40

User space had thread local storage for years. Why not to do it in a kernel? So our first local storage is a Cgroup local storage that Roman implemented. We have two flavors, just a regular one and per CPU. What it does is same as in user space.

23:02

It avoids extra lookups and adds performance. And in kernel, we obviously care about zero, zero overhead. And with this, I'm out of time. So to recap what we've talked about,

23:20

BPF in the future will be safe and will be easy to write. Thank you. Please ask questions.

24:01

Yeah, I will. Safety. In some fields, safety means real time and determinism. And considering also the JIT step, is there some idea of how to evaluate the determinism,

24:21

so the real-time properties of these programs? So the JIT part is done during the load. So JIT is probably a wrong word, and it always rubs me in the wrong place when I say JIT. Because JIT to me is a compiler engineer

24:40

by background, it's just in time. So BPF is not doing just in time, actually. It's converting BPF instructions into native assembler at the load time. So it doesn't add to the runtime cost at all. So whatever time it takes to verify the program in JIT, it's done in user context.

25:01

And the time is charged to the process doing the load. So the only real-time cost that can affect the real time is the final execution of the program. And the final execution is currently limited by these 4,000 instructions. If it's used in the wrong place, yes, it will start adding up.

25:21

Like if we're constantly doing 20 hash table lookups for every packet, yes, the networking processing will be slower. So this unavoidable to some degree. We have few ideas how to automatically mitigate

25:41

even such scenarios. But so far it's too many cool ideas to do and not enough people to implement them. But this is definitely on to-do list. Thanks.

26:02

So with BPF to BPF calls, you introduced the possibility of indefinite recursion. So do you aim to detect this by analyzing the call graph and not allowing loops? Or do you just limit the stack size? Or how is that done? It's checking for recursions that calls don't call back into each other.

26:21

And it's checking the maximum call stack that each function takes. So the initial stack size limit was 512 bytes. It's still the same even for multiple functions. But majority of the functions they don't need 512 bytes. So they just call each other until it reaches 512.

26:41

So there are a bunch of this, well, to some degree artificial limits. The nested number of calls currently is eight, I think. But this number just picked arbitrary. Why eight? Just because. Why 512 bytes? Well, kernel's probably okay if we consume 512 bytes, especially with some functions assuming two kilobytes.

27:04

And the static analyzer's able to do this? Or is that a runtime failure then? That's static analysis. All of this done by static analysis by the verifier. The only runtime check we currently have is a tail call limit. Tail call actually can be recursive. And tail call's limited to 32.

27:21

32, yes. So for all of these new things that you mentioned, what's the best? I'm unmuted.

27:41

I'm speaking into the microphone. Okay, cool. What's the best way to keep up with all of the cool stuff that's happening in BPF world? Like, except bug you maybe. Obviously, our dental systems go.

28:01

Well, both seriously and... It's an excellent question. Right now we're spinning up a website on Facebook and there will be a blog there. So we will try to keep track of posting all the latest news there.

28:24

And following the networking mailing list, I guess. But it's an excellent question. I-advisor meetings, yes, we have biweekly calls. The way we discuss all of this stuff. It's public, public. Anyone, anyone free to attend.

28:42

Yeah, it's pretty much a BPF call. We still call it I-advisor, but yeah. That's not recorded, right? It's not recorded, no. The light is off. Do you see a case for a set of common BPF programs

29:02

that live perhaps in the kernel source code or something very close to it as opposed to passing them around on GitHub from project to project as we discover bugs in our TCP sectors? Yes, that's what I sort of briefly, very briefly skipped here. Libraries. So that's exactly what it is. It's, we want to push some of the stuff

29:23

that just lives in the kernel and common primages will just be there. One question, like the Solaris tracing stuff, it has its own language, right? BPF is currently only machine code kind of thing, right?

29:43

And now you have this schema that you can use LLVM to also generate it. In that case, most, you would write the programs in something that resembles C, right? But do you have any idea where you want to go with this? Like do you expect that there is somewhat accepted standard language

30:01

for writing BPF rules later that is C or is it not C? Or that you want to do any kind of tool set or do you expect that this is up for like third party people who want to write their own tools with their own languages? What's the plan? Excellent, so I love this question. There is so much in it to me

30:23

because like the philosophy of the BPF is to be the construction set and let everyone build on top, including the languages. Why I strongly push C from the very beginning because of ease of use. Everyone knows C so let's make C work as well as it could.

30:42

At the same time, we're trying to enable everyone as well. Like BPF trace is a project that has pretty much detrace-like language, like detrace minus minus, detrace syntax that doesn't generate any C. It generates LLVM IR and feeds into LLVM begin directly. That's one project.

31:01

Then there is another project called ply, P-L-Y. It also has detrace-like language but this generates BPF instructions directly without any LLVM. So it best suited for embedded environment. Then there is another project that takes Python and generates, looks at the Python internal representations, generates BPF codes

31:22

without any LLVM just based on Python IR. So all of this is blooming and growing and there will be new languages. We constantly have discussions that we need to hire a language designer

31:40

to design a language for BPF. It may happen one day. The way it looks, the C one is definitely the one that's gonna be supported by upstream and hence probably has an edge over the others, right? Yes. I'm biased just a little bit over C.

32:05

Okay, I have a question. It's me. So BPF to BPF calls and you said that they are now being verified by the verifier and you can do the call graph analysis also. So what if you give input like as a massive,

32:23

as a program which generates like a massive call graph and if this analysis is done inside the kernel, how are you gonna handle that? You mean the concern that it somehow will take an infinite amount of time? Yeah, verifier runs out of time. Do you have a limit to that? Yes, so it has the limit of,

32:41

that's this crunch, what I call the limit of 192 instructions. So even if the whole program is like 4,000 instructions long, if it sees that this complexity is exploiting into a reach 192 and it'll say, oh, I give up. So this is coarse way to prevent

33:03

the ugly programs from consuming too much CPU time during verification. So when I try to use BPF for debugging, one challenge that I experience is that probably because I'm not used to it, is that when the program compiles

33:21

and it gets unloaded, then kernel verifier rejects it. But it was kind of really difficult to what the verifier says to my code. Are there plans to improve in that area? Yes, so ease of use is understood. It's not easy today, and yes, that people constantly fight verifier, and that's exactly, yes, so today,

33:43

from the code, whether it's C or D trace language or whatever else, when the verifier says, oh, these instructions you load in from an uninitialized memory, most of the time users have no idea why this assembler instruction, the way it corresponds to. So this step, type information,

34:03

and source life and line information that we currently very actively working on is exactly to address that. But first we thought we'll just do line number information, the way that typical user space application do. We decided to go a step further, and we'll just push pieces of the source in the kernel.

34:23

So there, the verifier can say, this piece of original code is misbehaving. Look for the bug there. Okay, I'm guessing, well, okay, one more question, and we're out of time.

34:43

Hello, what's the most difficult thing to do as a maintainer of BPF? The most difficult thing is to say no. When there are so many cool things are happening, and the most difficult is to figure out the balance

35:01

to not get too much into specific niche that will prevent the core from growing faster.

35:20

I can give more specific example, but that to me is the hardest. Thank you.

Recommendations