XDP (eXpress Data Path) as a building block for other FOSS projects
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 561 | |
Autor | ||
Lizenz | CC-Namensnennung 2.0 Belgien: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/44671 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
FOSDEM 2019424 / 561
1
9
10
15
18
19
23
24
27
29
31
33
34
35
38
39
40
43
47
49
52
53
54
55
58
59
60
63
65
67
69
70
78
80
82
87
93
95
97
102
103
104
107
110
111
114
116
118
120
122
123
126
127
131
133
136
137
139
141
142
148
153
155
157
159
163
164
168
169
170
171
172
173
174
181
183
185
187
188
193
196
197
198
199
200
201
205
207
208
209
211
213
214
218
221
223
224
226
230
232
234
235
236
244
248
250
251
252
253
255
256
257
262
263
264
268
269
271
274
275
276
278
280
281
283
284
288
289
290
293
294
296
297
300
301
304
309
311
312
313
314
315
317
318
321
322
327
332
333
334
335
336
337
338
339
340
343
345
346
352
353
355
356
357
359
360
362
369
370
373
374
375
376
377
378
383
384
387
388
389
390
391
393
394
395
396
406
408
409
412
413
414
415
419
420
425
426
431
432
433
434
435
436
438
439
440
441
445
446
447
448
453
455
457
459
466
467
471
473
474
475
476
479
480
484
485
486
489
491
492
496
499
500
502
505
507
508
512
515
517
518
529
531
533
534
535
536
539
540
546
550
551
552
553
554
555
557
558
559
560
561
00:00
DatenpfadArithmetischer AusdruckKeller <Informatik>Kernel <Informatik>Proxy ServerKonvexe HülleProdukt <Mathematik>Notepad-ComputerDoS-AttackeZustandsdichteHook <Programmierung>Quick-SortSocketProxy ServerProzess <Informatik>CASE <Informatik>Kernel <Informatik>BootenMereologieProgrammierungTreiber <Programm>SoftwareLastVektorpotenzialKeller <Informatik>CodeProjektive EbeneTabelleEinsZusammenhängender GraphTaskObjekt <Kategorie>ÄhnlichkeitsgeometriePaarvergleichRohdatenDreiecksfreier GraphFacebookSoftwarewartungBefehlsprozessorHyperbelverfahrenGoogolPhysikalisches SystemReelle ZahlOpen SourceRahmenproblemMinkowski-MetrikGemeinsamer SpeicherProdukt <Mathematik>RechenschieberRoutingSpeicherabzugBitProgrammiergerätMAPAdressraumFamilie <Mathematik>HybridrechnerPuffer <Netzplantechnik>Overhead <Kommunikationstechnik>ZweiKlasse <Mathematik>PunktInternetworkingDatenflussEinfacher RingSocket-SchnittstelleVorzeichen <Mathematik>Coxeter-GruppeART-NetzPersönliche IdentifikationsnummerBildschirmfensterUmwandlungsenthalpiePunktwolkeTopologieMomentenproblemOffene MengeDigitaltechnikPlastikkarteMatchingMehrplatzsystemComputeranimationProgramm/Quellcode
09:54
Proxy ServerLastAbstimmung <Frequenz>CodeVektorpotenzialBridge <Kommunikationstechnik>InformationWärmeübergangMetadatenPunktQuick-SortMinkowski-MetrikProgrammierungTreiber <Programm>Kernel <Informatik>Cluster <Rechnernetz>Proxy ServerQuaderInformationInstallation <Informatik>SoftwareVektorpotenzialPhysikalismusKeller <Informatik>Virtuelles privates NetzwerkFunktionalGlobale OptimierungCodeNormalvektorSoftwaretestMAPSchedulingZweiRahmenproblemSoftwarewartungDoS-AttackeBitDezimalzahlDatenstrukturTabelleVersionsverwaltungPhysikalisches SystemHilfesystemOpen SourceBridge <Kommunikationstechnik>RechenschieberComputersicherheitApp <Programm>Computerunterstützte ÜbersetzungAdressraumSystem FZentralisatorDienst <Informatik>CASE <Informatik>FacebookGruppenoperationAutomatische IndexierungKryptologieForcingInformationsüberlastungPlastikkarteGamecontrollerRoutingWeb-SeiteArithmetische FolgeBEEPMultiplikationsoperatorMereologieFigurierte ZahlHook <Programmierung>Virtuelle MaschineLastteilungAutomatische HandlungsplanungHardwareBetriebsmittelverwaltungRouterMessage-PassingOffene MengePrimzahlzwillingeProgramm/Quellcode
19:43
MetadatenDesintegration <Mathematik>TeilmengeGlobale OptimierungPatch <Software>Socket-SchnittstelleSoftwareE-MailHardwarePatch <Software>Kartesische KoordinatenTropfenKernel <Informatik>CodeProtokoll <Datenverarbeitungssystem>MaschinenschreibenCoxeter-GruppeMomentenproblemATMGeheimnisprinzipTeilmengeMinkowski-MetrikZweiStichprobenumfangDatenflussDifferenteEinsPhysikalisches SystemQuellcodeBitMultiplikationsoperatorLastteilungCASE <Informatik>AdressraumTabelleIntegralImplementierungDienst <Informatik>Wurm <Informatik>SpeicherabzugKeller <Informatik>MereologieTreiber <Programm>DatenfeldWarteschlangeSchlussregelInformationKonfiguration <Informatik>ResultanteSocketGeradeLastMetadatenGlobale OptimierungPunktRahmenproblemRegulärer GraphServerZahlenbereichMailing-ListeMinimumSystem FProgrammierungRechenschieberStrömungsrichtungGruppenoperationDefaultEindeutigkeitHook <Programmierung>KugelkappeOffene MengeBenchmarkMetropolitan area networkMAPRöhrenflächeInterface <Schaltung>Computeranimation
29:31
SystemaufrufPaarvergleichTreiber <Programm>ATMDesintegration <Mathematik>VollständigkeitProzess <Informatik>SpeicherabzugKontextbezogenes SystemResultanteGüte der AnpassungTropfenKartesische KoordinatenVektorraumPuls <Technik>CASE <Informatik>EinsSoftwareGebundener ZustandCodeSystemaufrufTreiber <Programm>Basis <Mathematik>Minkowski-MetrikVerschlingungSchnelltasteCachingMultigraphComputersicherheitDifferenz <Mathematik>BenchmarkPhysikalismusSkalarfeldÄhnlichkeitsgeometrieGlobale OptimierungGenerator <Informatik>HalbleiterspeicherBildschirmmaskeATMWeb-SeiteWellenpaketPhysikalisches SystemHardwareKernel <Informatik>Stochastische AbhängigkeitE-MailPunktwolkeKeller <Informatik>SocketDifferenteLeistung <Physik>Mailing-ListeElektronische PublikationProjektive EbeneSuite <Programmpaket>Streaming <Kommunikationstechnik>DifferenzkernStapeldateiOverhead <Kommunikationstechnik>Rechter WinkelAbstraktionsebeneComputeranimation
39:19
Treiber <Programm>Interface <Schaltung>IntegralKonfiguration <Informatik>Minkowski-MetrikCodeEinsSpeicherabzugGeradeMereologieTreiber <Programm>Kernel <Informatik>Endliche ModelltheorieFunktionalCompilerGoogolGlobale OptimierungPatch <Software>LastWarteschlangeARM <Computerarchitektur>AbstraktionsebeneMAPZeitstempelHardwareSocket-SchnittstelleBridge <Kommunikationstechnik>ProgrammierungStichprobenumfangImplementierungCASE <Informatik>HilfesystemATMKlon <Mathematik>ProgrammbibliothekGenerizitätPunktZweiHyperbelverfahrenFormale SpracheMomentenproblemSchnelltasteHalbleiterspeicherMultiplikationsoperatorTermZahlenbereichRechter WinkelRichtungSoundverarbeitungSystem FSuite <Programmpaket>AusnahmebehandlungWürfelFlächeninhaltEinfache GenauigkeitWurm <Informatik>Arithmetischer AusdruckSocketMailing-ListeOffice-PaketKugelkappe
49:08
Quick-SortTypentheorieRahmenproblemRohdatenBefehl <Informatik>ATMTermComputeranimation
50:22
Computeranimation
Transkript: Englisch(automatisch erzeugt)
00:05
Mangus is still in the first room, but I'll take the first part of the presentation. So, what is this XDP? It's a new programmable layer in the network stack,
00:20
sort of before the network stack, and we are seeing similar speeds as DPDK. We'll get into more details of that, and we actually have performance comparisons. XDP sort of ensures that the network stack stays relevant. It operates at layer two to layer three, and the network stack operates at layer four to seven. Get into a little bit more details.
00:41
So, I want to admit that we are not the first mover. Like, there's other solutions, but we believe that it is different and better, because our killer feature is that we are integrated with the news kernel, and we also have flexible sharing of the NIC resources. So, a little bit more details of what XDP is.
01:02
It's an internal fast path, and it is this programmable layer in front of the traditional network stack. It's already part of the upstream kernels, and actually also Rail 8, and it operates at the same speeds and the level as DPDK does,
01:21
and we are seeing these 10x performance improvements, and one of the points of being the internal part is that we can accelerate certain use cases inside the kernel. For example, forwarding. I'll get into what you'll say about that. Then, the second part of the presentation, we're going to talk about AFXDP,
01:41
which is address family for XDP sockets, and it is what I categorize as a hybrid kernel bypass facility, because we are allowing you to choose what packets should bypass the kernel all the way down to driver layer, and deliver that into a socket
02:02
that is accessible from user space, but we have this flexibility of the BPF program, choosing what we want to redirect and not redirect out of the kernel. So, why is XDP needed? It is basically because the network stack
02:23
had been optimized for layer four to seven, and we are taking this performance. It's like once we get the packets, the new network stack socket buffer, the SKB, is named the socket buffer because it assumes in the day where it was written like 25 years ago
02:42
that everything has to go into a socket, but there are certain use cases for layer two and layer three where we can do something faster, and don't take this performance overheads. And that is what XDP is, it operates at this layer.
03:01
So, I want to admit we are not the first mover here, but we believe it's different and better. So, there's actually a lot of kernel bypass solutions out there. The netmap, the DPDK, which is I think the most prominent one at the moment, PF ring, I think there's also some people
03:21
in the room that made that. Like long before we had something called XDP, and Google did some solution where they have the maglev, which they all of a sudden published a paper after DPDK came out and claimed they did this many years before.
03:40
The open-on-load, we have the snap switch, which is also here, and there's actually a solution very similar to XDP, which is a commercial solution from HAProxy called Endive. But all of these kernel bypass solutions, we're hoping we can some way find a way
04:03
to integrate with them, and maybe that they can use AFXDP inside of their solution so people that have been writing software for, but those systems can still continue using that by using the AFXDP stuff. So, why is it different and better?
04:22
Well, it's not bypass, it's the internal fast path. And the real killer feature is that we've integrated into Linux kernel. We are leveraging the existing ecosystem, as everybody here knows. And that's fairly strong. We also have this sandboxing by the eBPF code.
04:43
We get a lot of flexibility that you don't have to recompile your kernel, but you can actually put in these snippets of code that does just what you need and don't add too many instructions there, but you have to get the flexibility of doing this.
05:02
And it's also important at the same criteria that we have very flexible sharing of the NIC card, the NIC, so we can choose, pick and choose a package to travel into the network stack or we do something else with this packet. And we have the cooperation with the network stack
05:20
by these helpers, and we can do this fallback handling. And we also get access to, basically by running in the kernel, we get access to kernel objects. Like we can do the lookup, I'm going to talk about in the route table a little bit later. We even have now, in the recent kernel, you can look up from XDP to check if there's a socket
05:46
that will match this packet. Right now, we still don't allow you to directly manipulate the socket, the socket object you get back, but that's something we'll add later. You can manipulate the socket objects
06:00
in the TC hook later, where you can do the same lookup. So people are using this to, if they want to bypass, but don't want to bypass if the kernel handles the socket, the only thing you do in the XDP program is look up, do a look up. If the socket is already used by the kernel, then you send it to the kernel, or else you will use the bypass facility.
06:23
And that leads on to the AFXDP. That's a flexible kernel bypass, and we can deliver these raw frames in user space while we are leveraging the existing lead drivers and the ecosystem for the maintainership of that.
06:42
So I put a red slide to shock you here, so you sort of wake up, because it's fundamental to understand that I'm seeing this as a building block that you should use. And what do I mean by that? So I mean that I see XDP as a component. It is a core facility we are providing from the kernel,
07:00
but I need you guys to pick it up and use it, actually. And you put it together to solve a specific task. And I'm, by putting in fully programmable items in there, I'm not saying what you can do and what you cannot do. This is like, go invent something I couldn't imagine.
07:20
And so it's not a product in itself, and I really want existing open source solutions to do it, and there will also be some new ones that are going to use these components. And what I really see, like XDP is really like a hot, sexy topic, because we can do all these kind of millions of packets per second, but the real potential comes when we are actually
07:41
combining it with the other BPF hooks that exist in the kernel. And I call it like we can construct these network pipelines by using these different hooks, the BPF hooks. And that's actually what the project called Selium is doing. It's primarily for containers. And they're combining all these different components.
08:07
So now I'm going to talk about some of the use cases where XDP has already been used. And then I'm going, in each of the use cases, talk about what's the new potential and opportunities. I'm going to relate that to VMs and containers.
08:25
How can this work out? I'm speaking pretty fast, I think. So one of the most obvious use cases is the NTD-DOS. That was the first thing we sort of implemented
08:41
that we can drop packets really fast, because we haven't spent a lot of CPU cycles on it, because XDP happens down in the driver layer and we're allowed to run this, our eBPF program, which is our XDP program. So Facebook have already for, they've been the front runners in this,
09:02
and they've also contributed a lot upstream. They hired the eBPF maintainer. But actually for 155 years, they have been, every packet going to Facebook goes through XDP in some way. And Cloudflare recently switched to XDP, and they changed the NIC vendor in that process.
09:23
Before they were running a proprietary stuff with the open on load from Solarflare, I think. Yeah. So some of the new potential here for NTD-DOS for containers and VMs is that a container,
09:42
like a Kubernetes cluster or OpenShift cluster, you would not expose that to the internet, because that would be fairly dangerous. But you could actually use this loader XDP program on the host to do denial of service protection. You don't need, you want to expose this cluster
10:02
to have another box, hardware box, that does this denial of service filtering, because we can handle wire speeds, packets coming in from the XDP layer. We just load it on the host that the containers run on and can protect the containers.
10:21
The VMs is the same story. So in the host OS, you could load the XDP program to protect the guest OS, because it's fairly expensive to transfer a packet all the way into the guest to figure out that this packet shouldn't be used, and that's an easy way to overload the system. This is work in progress, the lowest one.
10:42
Michael Sorkin, all of the time, every time I meet up with him, he wants to be able to, from the guest, ask the host operating system by the driver to load an XDP program for it. There's some security concerns there, that's why we haven't allowed it yet.
11:03
But there's a really interesting possibility that you can allow a guest to ask the host OS to load a filter, a denial of service filter. We also have the use case of layer four load balancing. So this is what Facebook is using.
11:22
They used to use something called IPVS. It's a load balancer in the kernel. I mean, even the maintainer for the user space software for that, but I'm even recommending not using it. So what they did was they switched to XDP
11:40
and they reported a 10 times performance improvement. And not only that, they could remove some of the machines that did this load balancing because they do load balancing on the target machines themselves and shoot the packet over to the others. So there's no central point of failure.
12:00
And they even open sourced it, and it's on GitHub and called CatRan, if I'm pronouncing it correctly. I think it's something to do with a fish. So the new potential here is that we could also do load balancing into VMs or containers. So for the VMs, we can, at the physical NIC layer,
12:22
use the redirect action and redirect into a guest NIC. And this way we avoid allocating the SKB in the host operating system, which only is used for sending stuff into the guest. That's actually a really big performance improvement. Not a lot of solutions are really directly using this,
12:43
but I'll talk more about how this could be materialized. It's actually in the kernel now. We have in the twin tab driver, we can redirect raw frames. The performance is quite excellent. There's, what I'm seeing in my performance test is that it is actually, now we are depending on how the guest gets scheduled.
13:05
So it's a scheduling problem now that we are hitting how many packets per second we can throw in. There's a, containers is a little bit more difficult to use XDP because containers, they really need this SKB structure allocated. But one funny, interesting thing you can do
13:22
is from the physical NIC, you can redirect into a VTH device, which is what the containers have. And it's a fairly recent kernel version you have to have. But it's sort of got what you call native support. So it can bypass like the network stack and allocates.
13:42
It is the VTH that allocates and builds the SKB. And there's a small performance optimization by skipping some code. But I see it more interesting that you could actually run another XDP program on the VTH and redirect into another container.
14:00
By that, you could make interesting proxy solutions that works on this L3 layer that can, you can sort of install a container that only does proxy stuff for the other containers and install it as a container
14:20
instead of having to, having had that installed on the physical host. That's interesting use cases. Let's see if anybody does this. This is not something that has been done today.
14:40
So the approach could be used for accelerating, so it can be used for accelerating the VPN. But what kind of VPN?
15:02
Yeah, so I would recommend using something else called kernel TLS, which where we can actually, but for open VPN, I don't know, that's the TLS, but I don't know what kind of crypto the open VPN does. Yeah, okay.
15:29
This, yeah, this, I think I have repeated the question if you could use it for accelerating, like open VPN.
15:42
Some of it I wouldn't use XDP for, so I would use some of the other TC hooks for actually accelerating that. Actually, I think I'll move on. So also, it's actually fairly easy
16:01
to misuse XDP in the same way as you can use these bypass solutions. So instead, I want people to be sort of smart about how we can integrate XDP in existing open source solutions and leverage the existing ecosystem
16:21
for the control plane setup. But there's a trick you have to do to do that. You have to implement some BPF helpers, and BPF helpers is something you add into the kernel. And that's, like BPF is something you load your program into the kernel, and you're completely flexible there,
16:41
but once we add helpers, it becomes a stable API that you provide these helpers. And what I mean by a good example of what you can do is help us for having this slide. So the general thinking I want people to do is like see XDP as a software offloading layer
17:02
that can accelerate part of the network stack. And by doing that, you could, for example, take the route lookup, which we already done. So we have the fifth lookup helper. We exported that as a helper function. You can call from XDP. So what happens is that you will allow,
17:20
you'll do your normal route setup completely as you have it today. You'll install your router demands, your PDP demands, whatever, and you will have the kernel handle all this and also handle the neighbor table or the up table lookups. And in XDP, when you get the first packet in, you do the lookup, but it's not in the up table,
17:41
so the lookup will fail, and you call what you call XDP pass. So you pass it on to the normal network stack, which will then figure this out and call the up table lookup, set up a timer for when it has to request a new up lookup again and stuff like that. At the next packet we get in, it has done the up resolution.
18:01
It'll do the lookup and get back the next hop and also the MAC address and the exit point for the IF index exit point, and from XDP we can then shoot out the packet directly from this level. So that's the way of accelerating
18:23
the existing NegroStacks routing, like IP routing, it both work for IP4 and V6, up so and so now. We're going to add some more stuff for MPLS, but right now it's IPV4, IPV6 you can do this with, and do routing.
18:41
So it depends, it's a very, very simple program you have to load in the XDP hook. The next obvious target is doing bridge lookup. So you can have a helper that looks up in the bridge, FIPT, also called FIPT, but it's the lookup table,
19:04
and then you can accelerate bridge forwarding, and that goes into my other point of how do we accelerate into VMs, for example, if your VMs are set up and in a bridge that you have your VMs connected to the bridge, which is usually the normal Linux bridge, which is not very fast actually,
19:23
and that way we could just accelerate that directly from XDP. Without doing much other user space code around, having to code a program does this. So when people start to play with this, I also want to mention how you actually transfer
19:42
information between XDP and the network stack. So one trick is you can modify the headers before you call the network stack. So even though we call XDP pass, you can modify the headers and push or pop headers, and that way influence which receive handler
20:02
the network stack will use. That means you can, in principle, have the kernel handle a protocol encapsulation that the kernel doesn't know about. You still have to do some work on the transmit size also, of course. Another trick that Cloudflare uses
20:21
is they take the source MAC address because that's not used anymore in the lookups, and they modify that with a special value, and they use it to sample drop packets. They want to sample some of the drop packets that the night of service system is dropping, so they modify that, and later they catch it on
20:41
with the IP tables rule to lock that. Then we have something called metadata, which is placed in front of the payload, so XDP can write just in front of the payload. It can also extend the headers, of course, but if you don't want the network stack to see this for some reason,
21:02
and just some metadata you created at XDP level, you can use these 32 bytes. The other hooks, like the TCEBPF hook, can use this and update the fields in the SKB, but you can also save other information,
21:21
and the other thing is that for the AFXDP, the raw frames we deliver into user space, it will also be put in front of the payload, so you can get this information there. We have a lot of interesting idea of getting the hardware actually to put in this metadata, for example, getting a unique ID for every flow. That's something that the hardware can provide.
21:43
A very interesting thing is OBS, the OpenB switch. So, William from VMware actually implemented three different ways of integrating with BPF, and he did a presentation in the plumbers.
22:02
So we have, he actually did a full implementation of a re-implementation of OBS in eBPF, which I thought was a little bit problematic, but he basically re-implemented the whole thing. My whole idea would, and he had to have several tail calls to put in all this code
22:23
to handle all the different kind of cases, so that was basically putting too much code in the BPF step, which I think was a mistake in itself, because you should be a little bit more smart and use the second solution to offload a subset to XDP, don't have the corner cases there,
22:41
but fall back for the corner cases. He didn't succeed with that, because he was missing some of the helpers I talked about before that he should have argumented that he wanted to add some helpers to do a lookup in the OBS kernel table. And what he did also was actually also implemented the AFXDP integration with OBS
23:02
that showed huge performance gains, and I think that's what they're going with now. And we're going to hear a lot about AFXDP in just a minute. I'll hand over. Thank you. You gave me a couple of minutes, Liszt.
23:26
Okay, so I was actually up here a year ago together with Bjorn, presenting AFXDP for the first time. At that point in time, it was an RFC of dubious quality. You could probably use it to scare children with it, but a lot of things have happened since then.
23:40
It actually got into the kernel in 4.18 in August, and the two first zero-copy driver support stuff got into 4.20 in December. So tons of stuff have happened. So what's this talk going to be about, this part of the talk? It's going to be about three things. I'm going to show you where we are, performance-wise. I'm going to show you some of the use cases
24:02
and tie them into what Jesper talked about, and then also tell you what we're going to focus on for this year and try to get into the kernel. It's not going to be about how AFXDP works. If you want to know that, you should have attended last year. I'm sorry. No, you can actually go back to the Linux Plumbers Conference paper
24:20
that was published in November and read it there, or just talk to Bjorn or me, or you can talk to Jesper, too. He knows how it works. There's other people in the room. Ilyas, you know how it works, too. So just talk to people, and we can explain it. Okay, so I'm just going to go through a little bit of the basics. I mean, Jesper already covered some of this, but really what AFXDP is,
24:42
it's the way of getting packets from XDP out to user space very, very quickly, completely unmodified. It's whatever XDP does with the packet, that's what you see in user space. And actually, it is an option. I mean, you know when you write XDP program, you can either direct your packet into the kernel stack with XDP pass,
25:02
you can redirect it to another NIC driver, get it out there, and we added an option to be able to redirect it into user space. So you can actually target, you can tell exactly what socket you want to redirect. So you can actually make a load balancer in XDP to load balance packets across AFXDP sockets.
25:21
So, really nice. But really, what we're going to target here is performance. So where are we now? And I'm going to start by showing you where we are with the code that's in kernel.org at the moment. So, 4.20.
25:40
And the methodology here is that we just have a regular Broadwell server, 2.7 gigahertz. We use the Linux kernel 4.20. We have all the spectra and meltdown mitigations on. So all of it is on. So we have not turned that off. We use two NICs, two Intel i4D NICs, 40 gig NICs,
26:03
because we actually need two to show you the numbers, which is good. And we're going to use two AFXDP sockets per NIC. So it's going to be four queues that we want to use in these benchmarks. I'm going to have an XLO generator, like a commercial load generator, just blasting at this NICs at 40 gigabits per second, or per second per NIC.
26:22
So just full blast. And we're going to start with just showing where we are with the current code base. So these are the zero-copy drivers you find in 4.20. They were not optimized for performance. They were just, we're just so happy about it then. You know, it's okay, just get it to work and just get it in there. But I'm going to show you how we compare against AF packet.
26:42
Because today, I mean, if you use Linux and you want raw packets to use the space, you use AF packet. And AF packet is the purple stuff on the bottom there, which you barely see. And the green stuff is AFXDP in zero-copy mode. Because people are not aware, AF packet is TCP. Yes, or Y short. And they are libpcap, you know,
27:02
or you write lots of other applications running. But that's the usual case. And this is, the green one is AFXDP in zero-copy mode. On the Y axis, you have megabits per second, 64-bit packets. And you have three different applications here on the bottom line. You have packets per second, millions packets per second.
27:20
Millions packets per second, sorry. And on the bottom there, you have three microbenchmarks. And they're really simple. So the first one is RXDrop to the left. It just tries to receive packets as fast as possible. Doesn't touch the data. Just receive it, drop it. I mean, anything that you do with RX
27:41
is going to be slower than this. I mean, because you're not doing anything, you're just receiving it. And TXPush is the same thing, but for TX. You use pre-generated packets, and just try to send them as quickly as possible. You don't touch the data there, because it's pre-generated. So anything with TX will be slower than that. That's like the fastest you can go. And then we have another toy application
28:01
that actually touches the data. It's just an L2 forwarder. It does, you know, receives the packet, does a max swap, so it touches the header, sends it out on the interface again. So two first ones do not touch the data. Last one touches the data. And as you can see here, we're somewhere between 15 and 25 times as fast as AF packet.
28:24
So you can actually do packet sniffing on a 40 gig. I mean, not if you send 64-bit packets, obviously, but if you have larger packets, you can actually do it now with AFXB0 copy. Yes? Is it with one core?
28:41
It's a thing, yeah. No, actually, we'll get to that. It's actually two cores, but we'll get to that. That's very, it's like two slides. We'll get just from two cores to one core. I'll show you. So it's basically one core, because the application was not doing anything. But it is two cores that are used. Here it's just software running. But we'll get to that, two slides.
29:01
So what we did then during the fall was to say, let's actually scratch our heads and try to do some performance optimizations on this code and see what we can get. So that's what we did and presented at the Linux Plumbers Conference. And now the previous results you have are the purple, bluish stuff on the left there.
29:21
And the green bars is the patches we have in-house and are now trying to upstream performance-wise. And as you can see, the green ones here, we can get an increase of 1.5, so 150% increase in performance with the performance optimizations that we have now. And some of them are really simple. And we've already upstreamed some of them.
29:41
Others are more complicated. But this is what you can get now. So we're talking here like receiving 39 million packets a second for one application core and one software IQ core. And Tx we can get all the way to 68 million packets a second, which is pretty impressive, I think.
30:02
And then when you start to touch the data here, of course it drops, because you have to bring headers and stuff into your cache. And we're down to 22. But still, the improvement here is from 90%, 85%, 90% to 150% compared to what we have now in the 420 kernel. So now I think if you look at the green stuff here,
30:21
it's starting to look pretty good. But we'll get to the question you had there. So how are we actually running this? And there's a link down here to the paper. And the paper will tell you all about the performance optimizations we did to get to this. But I'm not going to go through that today. Please click on the link, download the paper,
30:40
and take a look. So how do we actually run this? And that gets to your question. So currently, with the benchmarks I showed you, we run this in what we call it run to completion mode. So we have two cores. One core is just doing the software IQ. So you're receiving the packets in NAPI mode. And the other core is the application.
31:02
In this puny little microbenchmarks, the application core is not basically going to do anything. It's going to receive packets and drop it or just send them. So it's just going to be very lightly loaded. And the software IQ is going to be 100% loaded. So that's the bottleneck. But really, this is not how you would like to run it. Because you're going to waste the whole application
31:21
core spinning. Because it's just busy polling. It's 100%. And it seems that most people don't want to do that. They want to do something where you actually can sleep. You call a sys call, go into the kernel. If you're not doing anything, you can schedule in something else or power save or whatever you want to do. So then we can actually do that. So we have to the right there is what we call the poll sys call.
31:41
So if you just take the file descriptor or the socket and provide it to the poll sys call and call it, there's something called the busy poll. It's really confusing. It's called the busy poll mode of poll. I think I was going to call it the poll sys call. What it does is when you call poll, the code is going to itself drive the NAPI
32:01
context in the driver. So the application calls poll, you go down, it starts to run the driver in the same context, get some packets, go up again to the application, the application reads them. So what you get there is that you get more like a DPDK way of doing it. It's just one core driving the whole thing. The one core where you have the application,
32:21
you have the RXTX poll, the NAPI, and so on. So this is only single core. The difference here, of course, between DPDK and DPDK, all of this will run in user space. You have no mode switches between user space and kernel. We have to pay for that, of course, and the sys call. But you're not going to call poll for every single packet.
32:41
You call it for a batch of packets. And in the examples we have, it's like 64. But you can tune it. This support, the basic poll support, is actually in the kernel already. But this support where you can execute all the way to NAPI, it's not in the 420 kernel. So I'm working on that.
33:00
Then I'll send some RFC out in a few weeks. But it seems like people, that's what people want to use. So how does that look performance-wise? Of course it's going to drop. We're using just one single core. We have to do mode switches. We have to do sys calls. So it's going to drop compared to the other mode.
33:21
And what you can see here, it's like for the RX drop, it drops from 39 down to 30, from 68 to 51 for TX, and a drop from 22 to 16. So we have a drop of 20%, 30% compared to. But we're only using one core now instead of two cores. So if you look at it on a per core basis,
33:41
we're actually performing better. Because you can now use two cores to process it. And of course, that will be doubled. And it will be 60 million packets a second for RX drop. And 100, 2 million packets a second for TX push, and so on. So still, if you look at it like that, this actually performs better.
34:02
But the key question is now, how do we compare to DPDK? Because DPDK is the benchmark of how fast you can go. So if we compare here, now we have four graphs on each single benchmark. So furthest to the left, you have the run to completion mode in AFXB where we use two cores.
34:22
And the green one is the pulse syscall. We use a single core and a syscall to go into the kernel. And then you have DPDK with a scalar driver. So same kind of driver that we use in Linux with scalar. And then you have furthest to the right, the yellow one or orange one is going to be DPDK via the vectorized drivers.
34:41
And we don't have any vectorized drivers in the kernel, at least not yet. If we just ignore the vectorized driver first and just look at the other one, we can see that we're still lagging some behind for the RX path. It's maybe a 40% drop if I compare the pulse syscall
35:00
with DPDK. Because DPDK here is only using a single core. It's not using two cores. So it should be compared to the pulse syscall. On TX, it looks better. I mean, run to completion mode, actually, it's faster than the DPDK scalar driver, again. But if you look at the pulse syscall, which uses as many cores as DPDK, it's still a drop there.
35:23
We're within maybe 20% or 80% of DPDK speed there. And it's similar result if you go to L2 forward. It's a 10%, 12% drop there, or 20% drop for that. But really, also consider that, yes, the vectorized drive,
35:41
at least in these very simple ones, it doesn't pay off that much for the TX push or the L2 forward. But it does have a significant performance improvement on the RX side. So I don't know if it's more efficient on the RX side. Bruce can comment on that. There it's really, I mean, it's like a 30% more than that performance gain.
36:01
So you can argue, should you start to implement these things in a Linux driver? I don't know. It's complicated to write them. It's hard to maintain. But if they give a significant performance improvement, maybe it's worth it. I think it's a good question. But now, at least, I think for the TX side here, it's starting to look good. This is within the bounds of what I think.
36:20
OK, yeah, this is reasonable. We're never ever going to be as fast as DPDK, because we're doing syscalls. We have user space versus kernel space and those things. But it's getting there now, I think. But RX side, we need to do some stuff. And Jesper has some ideas. I mean, you and Bjorn are looking into some performance improvements here.
36:40
So we have other ideas to introduce, which are not in this. So it's still possible to improve this. OK, so let's look at some examples of use cases for XDP. So the first one, obvious one, is to write, then, EFXDP PMD for DPDK, because AFXDP
37:01
is about getting raw packets out to user space quickly. And if you have something up there, for example, or a user stack, you're probably going to use DPDK. So this is the most obvious one. And we actually have an RFC out on the DPDK mailing list for an AFXDP PMD driver. And it has about 1% overhead compared
37:21
to just running AFXDP, which I think is good. It's actually less than 1% overhead. But the advantage here is that what you really can do here is like, which is nice, if you have a stream of packets coming into your system and some of it needs to go to the kernel stack and some of it should go to user space to some processing, XDP is a great way of solving that,
37:42
because it can divert already in the driver some traffic to the kernel stack and some traffic up to DPDK in the user space stack. It also creates a hardware independent application binding. If you only have the PMD of AFXDP, it's going to work on all drivers supporting AFXDP. If you have a user space driver, DPDK driver,
38:03
of course it's not going to work on the next generation of anything, because you don't have driver in there. So that's also good. It also provides isolation and robustness. You're not sharing any memory. You can restart it. You can use all the security features of Linux, because it doesn't rely on physical, contiguous memory and stuff like that. AFXDP is just another socket.
38:23
And we think it's going to be a good support to a cloud native, a good solution in the cloud native space, too. Because AFXDP is just a socket with no S abstraction. Works really well with processes or containers, because they're the same thing. And you get fewer setup restrictions. So we think there's a good, strong use case for having
38:43
an AFXDP PMD in DPDK. The next one is VPP, which is a very popular stack from Cisco in the FIDO project. Yes, you could just take, well, VPP supports DPDK. So you could take this AFXDP PMD and just run VPP.
39:02
But actually, you could also just write a native AFXDP drive in VPP, because there's an AADP driver in VPP. There's an AF packet drive in VPP. So you could do that, because VPP doesn't use that much of DPDK, basically the driver. And the rest, it just implements itself.
39:23
So that would be more efficient. I don't know how much more efficient it would be, probably not that much. But it would be a lot simpler, much less code and easier setup, if you just used AFXDP right away there. But nobody's tried this. Is anyone working with EPP? It should be easy to hack this, so please do it.
39:42
It will be fun to see. Are we missing anything? There's some functionality we need that we don't have in AFXDP. And the last one, AFXP integration with snub switch. This is an idea. Is Luke here? No? So he can't comment on it.
40:02
So you can actually use AFXP maybe in snub. The question here is, what kind of functionality is missing? And the nice thing about it, if you do something in snub on AFXDP, it's going to work on all drivers supporting AFXDP, instead of like now, where you have to write them for every single driver. So this becomes a hardware abstraction
40:21
interface in this case. But there are some things that snub uses that we don't have at the moment. So the question is, what do we have to add to AFXDP to support snub? But I don't think Luke can answer that. Maybe he can do that later. So some ongoing work.
40:40
So what are we working on right now? Of course, we're upstreaming these performance optimizations one by one. Some of the simpler ones have been already upstreamed. And the more complicated ones we're working on now, like the Paul Siskel support there. And something that Bjorn and Jaspi work on is XDP programs per queue.
41:00
So now when you install an XDP program, it's per netdev. So it covers all queues. But what they're working on is actually so you can install an XDP program on a single queue. And that's going to be a big performance boost on the RX path for us. It's also going to make it a lot simpler to do things. But I don't know how long you think it's going to take. But it's not trivial, I guess, to do that.
41:23
But yeah. Another thing we noticed when we got these things out, it was too hard to use AFXDP because it required you to write an XDP program, compile that with a Clang compiler, load it into the kernel, lots of stuff you had to do. It was just a big headache to get going.
41:41
I mean, sockets should be easy. That's what you want. I mean, you just want to do socket, you bind it, off you go. That's it. So what I did now, and there's a patch, it's a V3 patch up on the kernel. And it's going to be V4. It might be something else. But I included in the libbpf, I included helpers. That makes it really, really simple to set up these sockets.
42:03
So basically, you only call two things. You call a createHuman, which is the packet there, create socket, and off you go. That's it. And there are helper functions for everything. What people used to do, they just used our sample, which was a pretty rough sample. And they just cut and paste from that and paste it into the programs.
42:21
But now we have this library instead, libbpf, where you can use good optimized functions for these things. And this makes the sample program so much cleaner and nicer. Now it's only application. You also added the BPF program in itself, right? Yes. So you don't have to go out and compile that and stop CDAN compiler. You only need GCC.
42:40
Then there's a small array of the BPF instructions that gets loaded for you. So it loads everything under the hood, so you don't have to care about it. Of course, you can still load your XRP program, your own, if you want to. Of course, we're going to hinder that. But this facilitates that option. And another thing, when I showed the AF packet performance numbers, so AFXDP and AF packet does not have
43:02
the same functionality level. AF packet has more functionality. And one thing we're missing from AFXDP is this packet clone. So what happens when you do AF packet is that the original packet goes to the kernel, and you get a copy of the packet to user space. So we don't have that functionality in
43:20
XDP at this point. So what we'd like is to add this into XDP. So you can say, so time's up. It's going to show me pretty soon. I thought it's going to, oh, five minutes left. Great. So we're going to add that to XDP. So you can actually clone a packet and then send it up. And then this could actually be something you could use with the libpcap, so with yshort and TCPdom, which is
43:42
nice, because then suddenly you can sniff a 40 gigabit per second interface with nearly a single core, not really a single core, but two cores you could do it. And there's also something we want to start with with other people is adding metadata support to AFXDP.
44:01
That's also something you need, so you can put out your time stamp and things that AF packet has. OK, to summarize, XDP, Express Data Path, is the new Linux kernel fast path. And AFXDP is just getting packets to user space from XDP, unmodified after that. And we're trying to hit DPDK-like speeds, never going
44:23
to be as fast, but 80. If it's 80% of that, great. And really it's a building block for solutions, both these things. It's not a ready solution in itself. You have to build stuff on top of it or inside it. And there are many interesting upcoming use cases like, yes, we talked about OVS and bridges
44:41
and stuff like that. And notice if you have OVS in XDP, AFXDP becomes a conduit after that, which is great. You get the traffic after OVS or after the bridge or after processing your IP or something. So it's exciting. You can get more cooked traffic with AFXDP with the help of XDP.
45:01
So come join the fun. Questions? You talked about integration in DPDK and DPD, and what about integration in some languages? In terms of? In languages, like Rust. Oh, yeah, yeah, so. Yeah, I saw that somebody added Go language support
45:22
for AFXDP, and I said, I haven't seen Rust. We're not doing anything there, but it seems that people are starting to add support. So that was a repeat of the question. So language support for AFXDP. So we saw somebody from Google adding AFXDP support to the Go language, but just talked about Rust.
45:41
Yeah, I haven't seen that, but hopefully somebody's working on it. Yes? Yes? Have you considered any other model like
46:02
base internal space? Yeah, we have. No, it's quite a while ago. Yes, so the question was, Paul, it's an overhead? Yes, because he has a syscall. But it's a simple model.
46:20
People know how it works. So that's why we started with that. And we want the AFXDP sockets to look like sockets and feel like sockets, because it's simple. But you're right. I mean, you could have something like a doorbell functionality, especially if you have hardware underneath that understands the doorbells. And that would be more efficient. But we have thought about that direction, but there's nothing concrete.
47:09
Any other questions? Yes? Yeah, so in AFXDP there's three different modes you can
47:22
run in. You can run in what's called XKB mode, which works on any NIC. Any NIC will work, or any virtual NIC will work, too. We call it generic? Yeah, generic XDP. So it works by actually elevating the XKB, and then pretending the XKB and converting the XKB into
47:41
something that is contiguous memory. So it works. But of course you lose performance. And then you have the Yeah, but then the performance of that mode is still three times the performance of AF packet. But then you have what we call the XDP drive copy mode, which is if you have XDP support added to your driver,
48:01
like a lot of people here have, you can run that mode. And it speeds up RX to about 10, 12 million packs a second on our hardware. But TX doesn't have any of that support. And then the third one is zero-copy support. You have to add that into the driver. And we're trying to make that simpler. I don't know, was it like 700 lines of code?
48:21
800 lines of code? Let's say 1,000 lines of code for our driver. So it's not something you just do in a day, unfortunately. The important part is that the user space part looks the same. So your program, that's abstracted away from you. That's the important part. Choose my Mickey.
48:41
Well, yes. Today you have to choose Intel, actually, to get you supported. But you can tell that. Or Mellanox. Mellanox is actually. Mellanox, too? Yes. Oh, yeah. Yeah, yeah, and the XDP drive support, you can pick a lot of vendors, like Netronome and, you know, a lot of them. Netronome also works, right? Yeah. But for the zero-copy support, yes. The zero-copy support is only in the Intel. Yeah, I want more people to support it.
49:02
So I'm just, you know, hope it's not only us. Yeah, actually, yeah, you have some implementation on ARM. So for zero-copy. So there's another question? Oh, you should have that.
49:21
There was a person in the audience that actually tried to implement it and got rejected on the main user. So he actually tried to implement it, and it got rejected upstream on the terms that, right now, XDP is like the raw Ethernet frame. And all of a sudden, with Wi-Fi, it could start in different ways.
49:43
So it is sort of possible, but then we would have to introduce, like, different hooks in the Wi-Fi to determine what kind of type of packet is this coming in. And we would have to have some if statements in there. And we are counting nanoseconds here, so we didn't want to introduce anything
50:00
that slowed down our performance for supporting Wi-Fi. So I think if we want to support Wi-Fi, it would be in another XDP mode. So it would be called Wi-Fi XDP or something. That's all we have time for. Thank you very much, Jesper. Thank you.