XDP and page_pool allocator
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Alternative Title |
| |
Title of Series | ||
Number of Parts | 490 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/47047 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Resource allocationResource allocationWeb pageDevice driverComputer animation
00:16
Device driverSoftwareSoftware maintenanceExpressionDatenpfadKernel (computing)Device driverResource allocationProxy serverPersonal digital assistantStack (abstract data type)Digital filterProcess (computing)Computer programStructural loadFacebookSpacetimeSynchronizationStructural loadKernel (computing)HookingTable (information)Interface (computing)BitCASE <Informatik>Semiconductor memorySoftwareBefehlsprozessorDevice driverLevel (video gaming)Proxy serverSoftware engineeringWeb pageGoodness of fitDevice driverMemory managementDisk read-and-write headFunctional (mathematics)RootWhiteboardBinary fileType theoryMoment (mathematics)Direction (geometry)2 (number)NumberDemosceneChemical equationBit rateExpressionRoutingFacebookBus (computing)Point cloudRight angleProcess (computing)Computer animation
02:18
Device driverKernel (computing)SpacetimeComputer networkStack (abstract data type)Electronic data processingSpacetimeRight angleDevice driverInterface (computing)Group actionRemote procedure callDecision theoryFrequencySoftwareComputer architectureDressing (medical)Computer programMoment (mathematics)GodMathematicsEmailPlastikkarteMessage passingAddress spaceOpen sourceDifferent (Kate Ryan album)BefehlsprozessorComputer animationProgram flowchart
03:04
SpeichermodellSemiconductor memoryLinear mapSpacetimeWeb pageMemory managementDirected setBuffer solutionWritingReading (process)Frame problemEmailEmpennageResource allocationOverhead (computing)Context awarenessAmsterdam Ordnance DatumDigital rights managementData modelDevice driverCache (computing)SynchronizationHeegaard splittingPatch (Unix)Electric currentSoftwareWeb pageKernel (computing)PhysicalismResource allocationCASE <Informatik>Stack (abstract data type)Buffer solutionDirection (geometry)Semiconductor memoryEndliche ModelltheorieHeegaard splittingSynchronizationFunctional (mathematics)Frame problemContext awarenessPatch (Unix)Moment (mathematics)MultiplicationOrder (biology)Cycle (graph theory)Perfect groupMultiplication signNormal (geometry)Boss CorporationData managementBitCasting (performing arts)Computer animation
05:48
CodeComputer hardwareInformation securityDevice driverIntelComplex (psychology)Bit rateProcess (computing)Complex (psychology)Device driverDemosceneExtension (kinesiology)Device driverImplementationSoftware developerLatent heatDirection (geometry)Drop (liquid)Denial-of-service attackInformationWhiteboardKernel (computing)Message passingSinc functionOrder (biology)Computer animation
06:40
Mathematical optimizationComputer networkPeripheralLie groupWide area networkLatent heatBinary fileDevice driverWhiteboardSoftware developerComputer hardwareLocal area networkExecution unitInterface (computing)Cycle (graph theory)Physical lawGoodness of fitWeb pageVideo gameDiagramComputer animation
07:14
Buffer solutionFlagArchitectureMiniDiscDigital filterGroup actionComputer networkSoftwareKernel (computing)SpeichermodellResource allocationSemiconductor memoryAreaMemory managementDrop (liquid)Resource allocationPublic domainCASE <Informatik>Slide ruleBuffer solutionSemiconductor memoryComputer programSoftware developerMessage passingComputer wormCartesian coordinate systemInformationPairwise comparisonNetwork socketComputer hardwareFunctional (mathematics)Web pageOrder (biology)ImplementationPointer (computer programming)Cache (computing)Standard deviationBefehlsprozessorInterface (computing)Remote procedure callComputer architectureSoftwareResultantDevice driverStack (abstract data type)MappingGroup actionContext awarenessLastteilungDiagramQueue (abstract data type)Data managementBit2 (number)Dynamic random-access memorySolid geometryRing (mathematics)DemoscenePhysical systemBit rateMoment (mathematics)Network topologyAeroelasticityOffice suiteStructural loadTraffic reportingCuboidEscape characterPoint (geometry)Electronic mailing listFerry CorstenRoundness (object)CountingTouch typingLoop (music)MereologyWechselseitige InformationExtension (kinesiology)Beat (acoustics)Endliche ModelltheorieWeightView (database)Computer animation
14:27
CausalitySound effectInterface (computing)Right angleDigital photographyMoment (mathematics)SpacetimeData storage devicePlastikkarteSource codeWeb pageSoftwareSemiconductor memoryComputer animation
16:39
FacebookPoint cloudOpen source
Transcript: English(auto-generated)
00:05
Hi, we're going to present XDP and the page pool allocator and how you can easily add a driver or convert an existing Linux driver into using XDP by using an internal API. The reason we decided to talk is that I'm Amilias, I'm the technical lead for Linaro heads and fog networking department
00:23
and I'm serving as a co-maintainer for the page pool API at the moment. I've added XDP support on a kernel driver and Lorenzo is a software engineer for Red Hat. He's maintaining a wireless driver, it's MT76, right, and he's added XDP support on the Espresso bin board which uses the Marvel MVnet driver.
00:43
So, do you know what XDP is? Anyone does? All right, good. Let's go a bit faster on this one. It's a software offload path for the kernel. On the driver level we add some hooks on the RX path and by using the
01:01
page pool API for the memory allocation we don't have to keep reallocating and freeing memory when we process the packets, we just have to sync to the correct direction, the DMA direction, for the CPU and the network interface to pick up the packets. It was initially designed to operate on layer 2 and layer 3 while the Linux kernel operates on layer 2 to say
01:20
layer 7 but it's mostly optimized around layer 4. There's two reasons we get better performance on XDP. The first one is that on most of the cases we manage to recycle the memory we're using and we skip all the kernel paths that we don't really want like IP tables or the TC hook or
01:43
the root lookups and stuff like that. It's important to keep in mind that this is not a kernel bypass. One of the functionalities is a kernel bypass and you can dump packets directly to user space but it's an internal fast path and we'll elaborate on this a bit more. It uses
02:02
existing kernel APIs and existing kernel functionality and you can program the number of packets and the type of the packets you want to process through BPF. It's currently being used by Facebook and Cloudflare on load
02:24
The XDP you see on the driver is actually the BPF program doing the decision-making so if it's an XDP pass which is one of the actions you have on XDP you send back the packet to the Linux networking stuff. If it's an XDP ticks what you do is that you send the packet back out of
02:40
the interface it came from by changing the header or the source and MAC address or the IP or anything you want to change on the packet. The XDP redirect currently sends the packet over to user space or an remote CPU or anything you decide on there on that one and there's the NDO XDP X meet which you can pick up the packet the moment it arrives on your network interface
03:00
and offload it to a different network card without having to go through the kernel network stack. Now the reason we created the page pool API is that the memory model for the whole approach is a bit weird. We require packets to be in contiguous physical memory and this is not a requirement from us it's it comes from the BPF direct access for validating the packet
03:23
and the correctness and you can't have one packet split across multiple physical pages at the moment so you're limited to non jumbo frames and we don't we don't mean 1,500 byte packets it's just
03:40
anything below a page size can be accommodated in an XDP frame. So the problem with that is that you cannot allocate the memory you want with whatever we have in the kernel like functions like Napier lock frag which allocates fragments for your data and it's faster because we cast things in
04:00
there you you really have to allocate the page you have to account for the headroom and the tail size we need on BPF and for the whatever you need on the SKB. Now what we discussed is that the buffers must be recycled in order to get the speed. So the page pool allocator we have it's optimized for one packet per page there we have use cases of people splitting the page and
04:24
fitting it multiple packets in it but you you can't recycle based on the page pool allocator recycling functions you have to recycle on your own on that case. On the native packet recycling we do it in the NAPI context mostly so this is really fast because you don't have any extra locking
04:41
overhead you're already protected by the NAPI context and the API also offers DMA managing capabilities that means that it can map your buffers it can sync your buffers correctly and there's improvements from Lorenzo that speed up this even more. Now this is not all perfect if you switch from a NAPI
05:01
alloc SKB that the kernel is doing to an XTP your network stack your normal network stack in the kernel is going to slow down because allocating a page compared to allocating fragments is substantially slower but if you use if you use it for XDP then you get all the native performance improvements we have with recycling packets. The memory footprint is bigger because
05:22
instead of allocating the the amount of size you want for the packet you allocate the page and you fit the packet wherever you want in that page and we do have some off-tree patches to get some performance back from that penalty so we we have patches and we managed to recycle buffers even if they
05:43
address the normal network stack so if it's an SKB we eventually recycle that buffer as well. So actually Elias has gone through some XDP requirement and some more general information about XDP and I will give some more details
06:05
about how implement XDP in an Ethernet driver. Actually I use the Mvneta Marvel one gigabit drivers as reference since for example the Intel or the Marvel implementation, Mellanox implementation is are much more
06:21
complex. We need to take into account that in order to be accepted in the Linux kernel our driver needs to implement all the possible XDP verdict that are XDP drop, XDP Tx, XDP pass and XDP redirect. Here I reported some
06:43
other specification of the Marvel Espresso Bin that is the development board I used to add XDP support to the Mvneta driver and we can see that the Marvel Espresso Bin runs a Cortex-A53 and for networking we
07:01
have two gigabit Ethernet LAN port, one Ethernet 1 port, all of them connected together through an Ethernet DSA hardware switch. This diagram outlines the lifecycle of a buffer using a page pool allocator and
07:23
we can see that it is the page pool allocator is usually created opening an interface since it is actually associated to a given a risk queue in order to avoid the locking penalties. From here we can see that it's
07:43
possible to rely on the page pool API in order to for DMA mapping and DMA syncing using Desplex actually. And what is important to notice in this slide is that when the NAPI poll runs, it actually runs
08:00
an eBPF program that is attached to our network interface and this eBPF program will return an eBPF, an XDP verdict, let's say a result, then the buffer will be recycled according to this result. Actually the page pool
08:21
allocator will have two caches, one in interrupt caches that is used for when the driver is running in interrupt context and we have a single reference to the buffer or a pointer in cache that is used when we have a single reference to the buffer. Whenever our driver needs to
08:43
refill the DMA engine with the new buffer running, for example in this case the Mvuneta rx refill, we can access to these caches instead of going through this lower page allocator. Here I reported the
09:00
Mvuneta XDP architecture and you can see that whenever the Mvuneta poll runs it allocates an XDP buffer that is the counterpart of an SKB for XDP. And the Mvuneta run XDP will actually run the eBPF program in the eBPF sandbox on our XDP buffer and will retire one of these
09:22
XDP verdict, XDP pass, XDP drop, TX or redirect and the buffer will be managed accordingly. It's important to notice here that the XDP buffer, the struct XDP buffer, is allocated on the stack
09:40
and not through a KMM cache as is done for a classic SKB. Now let's go through each possible XDP verdict and let's consider XDP drop. XDP drop is returned by our eBPF program when it wants to drop the packet as fast as it can
10:04
and the typical use case for XDP drop is an anti-DDoS application. We can see here that whenever the packet returns XDP drop the packet will be recycled in the interrupt cache using page pool recycled redirect.
10:23
And moreover here I reported the comparison between a simple program that just run XDP drop and the same functionality implemented with TC, with the TC filter and TC action. We can see that with TC with XDP we can almost reach 600 kilo packet per second drop,
10:46
while with TC we can just roughly drop 180 kilo packet per second. Here we see how XDP-TX works in the Mevuneta driver and XDP-TX is used to
11:03
transmit the packet back to the interface where we received the packet. We can see that now typical application, for example, is a load balancer in this case. We can see that now that running the Mevuneta XDP transmit back, it's not the, I mean, the Mevuneta XDP transmit back will insert the packet in the
11:26
hardware the M8X ring and it's not important in this case to map the buffer since it has been already mapped by the page pool API. We just need to flush the CPU caches in this case because the device is not current. Here we have XDP redirect. That XDP redirect
11:47
is used to transmit the packet to, for example, the remote interface or, sorry, to remote CPU or to even a socket using, for example, AFXDP.
12:03
And the typical use case is like, for example, layer 2 forwarding. It's important to notice here that in order to redirect to a remote interface, for example, the device should implement the NDO redirect transmit function pointer and here we have
12:21
the implementation done for the Mevuneta. We notice here that in this case it is necessary to remap the buffer since it is being received actually from remote device. Last verdict is XDP pass. XDP pass is used to send the packet to the standard
12:42
Linux networking stack and in the Mevuneta implementation, we can see that we can rely on the build SKB. So there is no need to reallocate the buffer for the payload of the packet but we need to take into account when we allocate the packet using the page pool API
13:02
that we need to take into account even the size of the SKB shared info. What we notice moreover in this slide is that in this particular case we are not able to recycle the buffer yet since whenever we need to refill the domain giant with new buffers we need to go
13:22
through the standard page allocator but as Silvia said this feature is under developing. So in conclusion we saw some XDP requirements and some basic information about XDP like XDP memory model, we saw some basics about the page pool allocator and how to implement
13:45
each XDP verdict using this API and we saw the Mevuneta implementation as a reference. Future works are definitely the adding support for SKB recycling for the XDP pass
14:01
functionality, for XDP pass verdict and for example regarding the Mevuneta we need to add support for the XDP support for the hardware buffer manager that is available on some device like solid DRAM ClearFog, native support for AFXDP and some interesting bits that are currently
14:22
on the XDP roadmap. Questions? Please. For me or for him? Sorry, I was wondering for me or for him?
14:54
Yes? No, that's Magnus, I don't know if he's in the room. Yeah, he's back there. I can repeat
15:14
the question. One of the restrictions with AFXDP is that you couldn't use huge pages when you needed to map memory from the user space, right? The answer is that you can do it but it's not
15:39
internally optimized at the moment for AFXDP. Which interfaces? The first diode interfaces?
16:18
VH. This depends on your on the card you're on. No, I think it's software. Software
16:26
implementation. Yes, I've never died actually. We don't have any intention of working on it at home. Thank you.