We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Mixing kool-aids! Accelerate the internet with AF_XDP & DPDK

00:00

Formal Metadata

Title
Mixing kool-aids! Accelerate the internet with AF_XDP & DPDK
Title of Series
Number of Parts
490
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
"With its recent advancements, AFXDP is gaining popularity in the high performance packet processing space. As a result, existing frameworks for packet processing, such as DPDK, are integrating AFXDP support to provide more options for moving packets to user space applications. The challenge with such integration is that both AF_XDP and frameworks like DPDK have their own assumptions and constraints about such things as, for example, how to align or manage packet buffers, making the integration less straight forward than it might appear at first glance. This talk takes a look at the usability of AFXDP pre-kernel v5.4, before diving into the recent challenges we encountered when integrating DPDK and AFXDP, and how we made changes (on both sides) to allow the two to work together in a much more seamless manner."
33
35
Thumbnail
23:38
52
Thumbnail
30:38
53
Thumbnail
16:18
65
71
Thumbnail
14:24
72
Thumbnail
18:02
75
Thumbnail
19:35
101
Thumbnail
12:59
106
123
Thumbnail
25:58
146
Thumbnail
47:36
157
Thumbnail
51:32
166
172
Thumbnail
22:49
182
Thumbnail
25:44
186
Thumbnail
40:18
190
195
225
Thumbnail
23:41
273
281
284
Thumbnail
09:08
285
289
Thumbnail
26:03
290
297
Thumbnail
19:29
328
Thumbnail
24:11
379
Thumbnail
20:10
385
Thumbnail
28:37
393
Thumbnail
09:10
430
438
InternetworkingIntelSoftwareComputer animation
Process (computing)ArchitectureBefehlsprozessorDevice driverSpacetimeAddress spaceSinguläres IntegralComputer networkMemory managementLatent heatAsynchronous Transfer ModeSocket-SchnittstelleData modelOrder (biology)FamilyKernel (computing)Variety (linguistics)Library (computing)Module (mathematics)Memory managementSinguläres IntegralSet (mathematics)Greatest elementDiagramSoftwareStack (abstract data type)SpacetimeSocket-SchnittstelleProcess (computing)MikroarchitekturAddress spaceDevice driverWorkloadMiniDiscCartesian coordinate systemEndliche ModelltheorieLatent heatPiComputer architectureMathematical optimizationOcean currentPhysical systemDifferent (Kate Ryan album)Computer animation
Statement (computer science)Software frameworkMemory managementConstraint (mathematics)Data bufferDiscrepancy theoryTexture mappingDirected setAreaEquals signSinguläres IntegralSpacetimeWeb pageAddress spaceBoundary value problemMessage passingInstance (computer science)Process (computing)Heegaard splittingCASE <Informatik>Moment (mathematics)Semiconductor memoryINTEGRALSpacetimeSound effectAreaOrder (biology)Constraint (mathematics)Web pageDifferent (Kate Ryan album)Equaliser (mathematics)MappingCartesian coordinate systemQuicksortMemory managementWebsiteForm (programming)Data structureData bufferLine (geometry)OctahedronBitSampling (statistics)CalculationBoundary value problemPower (physics)Term (mathematics)NumberGoodness of fitSinguläres IntegralSoftwareComputer configurationDevice driverGroup actionSystem on a chipAddress spaceMobile appCuboidComplex (psychology)Buffer solutionDiscrepancy theoryVector potentialSoftware frameworkComputer animation
Web pageBoundary value problemAddress spaceDifferent (Kate Ryan album)Moment (mathematics)Computer animation
Asynchronous Transfer ModeCycle (graph theory)Kolmogorov complexityPoint (geometry)AdditionCASE <Informatik>GodWeightSemiconductor memoryComplex (psychology)Normal (geometry)Uniform resource locatorOcean currentINTEGRALAsynchronous Transfer ModeOrder (biology)BitPoint (geometry)Cycle (graph theory)FreewareResource allocationSound effectTable (information)Cartesian coordinate systemType theorySeries (mathematics)Term (mathematics)CodeMereologyDifferent (Kate Ryan album)Data structureTheory of relativityDegree (graph theory)Electronic mailing listNumberEmailCuboidMobile appSinguläres IntegralGoodness of fitComputer animation
Singuläres IntegralWeb pageBoundary value problemAddress spaceData bufferFile formatWeb pageBoundary value problemSemiconductor memoryField (computer science)Constraint (mathematics)CASE <Informatik>BitCalculationPower (physics)Line (geometry)Auditory maskingTrailAddress spaceNetwork topologyEndliche ModelltheorieMereologySound effect40 (number)Data storage deviceINTEGRALSinguläres IntegralSoftware frameworkBuffer solutionComputer animation
Constraint (mathematics)Principle of maximum entropyDisintegrationConstraint (mathematics)CASE <Informatik>Electric generatorTerm (mathematics)Portable communications deviceCuboidLevel (video gaming)Cartesian coordinate systemAsynchronous Transfer ModeSinguläres IntegralComputer animation
CryptographyMemory managementComputer networkFunction (mathematics)Control flowUsabilitySinguläres IntegralCartesian coordinate systemInsertion lossSound effectElectronic mailing listFunctional (mathematics)Multiplication signVideo gameScripting languageDevice driverEndliche ModelltheorieCASE <Informatik>Semiconductor memoryPresentation of a groupWordCombinational logicGame controllerUsabilitySoftware developerSinguläres IntegralContrast (vision)Variety (linguistics)Goodness of fitVirtualizationPower (physics)Computer animation
Singuläres IntegralSinguläres IntegralWordComputer animation
Link (knot theory)Row (database)Fisher informationRankingNP-hardSoftware testingWorkloadNumberBenchmarkComputer animation
Open setPoint cloudFacebookOpen source
Transcript: English(auto-generated)
So we'll kick off. Hi everyone. My name is Kevin Latz. This is Ciara Loftus. We're network software engineers out of Intel in Shannon, which is hidden away in the west of Ireland. Today we're going to mix some kool-aids and hope things go quicker in the internet then.
So most of you probably know what DPDK and FXTP are. For those of you who don't, we'll do a quick introduction. So DPDK is a set of user space libraries and drivers. They aim to accelerate packet processing workloads, and they run on a variety of CPU architectures.
Some important things to remember for this talk is that DPDK supports many different PMDs, and they're usually device-specific, and DPDK has its own memory management system. FXTP is a kernel-based address family optimized for high-performance packet processing.
FXTP has its own sockets in order to move packets from kernel space to user space, and it uses the in-kernel fast path, so it bypasses the network stack in order to move those packets quickly. So if we take a closer look at a simplified diagram of the traditional DPDK model, down at the bottom in kernel space, we have
DPDK-specific kernel modules. They interact with the NICs and expose them to user space. In user space, then we have all of our DPDK PMDs and our applications, and they work together in order to do whatever wonderful things you want to do with your packets.
The aim of this work was to introduce and use the new DPDK FXTP PMD, and then that will directly talk to your NIC driver, so you can still use all of your usual kernel tools that you like using, like IFconfig and so on.
So the goal of this work was to have all DPDK applications working out of the box with the new FXTP PMD, and of course it should do so with good performance. The performance we were aiming for was close to or on par with the
kernel sample app XDP SoC. The challenge with this was that frameworks like DPDK have their own memory management, as said, and these come with constraints and assumptions of their own. DPDK specifically, we have a discrepancy between the
DPDK and the AFXTP buffer alignment, so this prevents us from mapping DPDK memory buffers directly to AFXTP UMIMs, and in order to do this mapping, we needed to do some extra work and complexity which negatively impact performance.
Okay, so I'm going to talk about how both AFXTP and DPDK lay out their memory for packet handling. I'll talk about the differences between the two and why those differences pose the integration challenges, which Kevin just touched on there.
So AFXTP, it's got this concept of a UMIM or user memory, and it's essentially an area of memory allocated by the user for packet data. The UMIM, it's split up into equal sized chunks with each chunk being used to hold data from a particular packet and
how it's used is, for instance, on the receive path, the kernel will place packet data into a chunk for the user space process to retrieve, and in our case, our user space process is DPDK, and on the transmit path, the user space process places packet data into a chunk for the kernel NIC driver to transmit.
Prior to kernel 5.4, this UMIM, this area of memory that AFXTP uses to hold packet data, it had a number of restrictions on it in terms of its size and its alignment. The first being that the start address of the UMIM had to be page size aligned, so that's going to be 4K in most cases.
The chunks within the UMIM, they had to be power of two sized, and kind of as a side effect of that, the chunks could not cross page boundaries. And in a kind of networking use case, that leaves you really with only two potential chunk size options, either 2K or 4K.
Anything bigger than 4K, and you're going to cross the page boundary, and anything smaller than 2K isn't big enough for a networking package or a networking use case. So in this example here, we've got a chunk size of 2K. We've two 2K chunks per 4K page, and as you can see, none of the chunks are crossing the page boundary.
Everything is nice and neat and tidy. The reason for these restrictions is essentially it just makes calculations in the kernel a little bit easier. When everything is nicely aligned, you can use things like masks, etc. Okay, so let's
see how DBDK lays out its memory for packet handling, and see if it satisfies the requirements of the AFXTP UMIM. So DBDK, as many of you know, we're in the SDN room. It holds packet data inside structures known as memory buffers, or MBUFS for short.
And a group of those together is known as an MBUFPool. And DBDK MBUFPools, they don't have as strict restrictions on them as the AFXTP UMIM. So for instance, MBUFS can be of any size within reason, and they can have arbitrary alignment relevant to the page size, so they can cross page boundaries.
So in this example here, we've got an MBUFP size of maybe 3.5K, and our MBUFPs are crossing page boundaries all over the place. And I suppose why do we care whether or not the DBDK MBUFPool satisfies the requirements of the AFXTP UMIM?
And the reason is that in order to get the highest performing integration of AFXTP and DBDK, we need to map the MBUFPool directly into the UMIM to get a zero copy data path, which is obviously going to be the best, the most performant. But as you can see here, that's not possible at the moment.
This is just one example of a DBDK MBUFPool. There's plenty more examples of different sized MBUFS, different alignments, and most of them won't comply with the kind of restrictions of the UMIM. But to get around this, the clever folks in the DBDK community have come up with a number of solutions
to get them to integrate and work together. Each of them have a varying degree of success in terms of performance. So the first solution that was considered was copy mode. And in this mode, we allocate memory for our UMIM, and we also allocate our DBDK UMIPool as normal.
And we simply memcpy between the two locations in memory. This works really well, but in terms of performance, it's not the most performant, just due to the cycle cost of the memcpy being pretty high. But nevertheless, it made it into a DBDK release in 1905
as part of the series that initially introduced AFXTP support. The second approach then that was looked into was this alignment API. So it was proposed to introduce a new API into DBDK, which allowed you to specify the type of alignment you wanted for your MBUFPool.
And then any application you wanted to work with AFXTP could use this new API and mold its MBUFPool to fit the UMIM requirements. Then you could do the one-to-one mapping, and you could get your zero copy performance.
But even though this did give really, really good performance, it was deemed a bit too invasive, so it didn't make it into a DBDK release. It was invasive because you had to change your application to get it to work, which kind of went against what Kevin said at the start about apps needing to work out of the box. So that didn't get into a DBDK release, but it generated a good discussion on the mailing list, which led to this third approach.
I think it was suggested by Oliver Matz and implemented by Xiaolong Ye. And this approach uses DBDK's external MBUF feature, which allows DBDK MBUF to, instead of holding the packet data in the structure itself,
to point to a different location in memory. In this case, we'll be pointing to our UMIM chunk. And then you can achieve your zero copy. However, there are still additional cycles with this solution, so there's additional complexity involved in attaching and detaching that external piece of memory from your MBUF.
But then again, it does give a really, really good improvement over copy mode, I think 29% for a certain use case. So it made it into DBDK 1908 as kind of a first gen AFXTP zero copy solution.
At this point, we felt that the community had taken DBDK as far as it could in terms of performance with AFXTP. But we still felt that there was still some performance left on the table, some cycles to save. So at that point, we decided it would be a good idea to start looking at the kernel side of things,
and maybe looking at adapting the UMIM to make it a bit more flexible, to work with the flexibility of DBDK, as opposed to trying to make DBDK fit the narrow restrictions of the UMIM. So what did we do in the kernel when we finally took off our DBDK ads?
We took a look at the original UMIM and its constraints. Being page size aligned was one major restriction that we had to lift, so we enabled arbitrary trunk alignment, so you can now align your trunks anywhere you want within the UMIM.
As a part of this, we allowed arbitrary trunk sizing as well. So now you can size and align wherever you want within the UMIM, much more flexible than the original. With this, we also had to allow the crossing of page boundaries, so we now need to keep track of whether pages are physically contiguous in memory or not.
If they aren't contiguous, like chunk three in this case, let's assume page three is non-contiguous to page two, then it will cross into a non-contiguous memory region, so we can't use that address, we discard it to get a new one, and we use the start of the next page.
So we do have a gap in memory. This is just one of the side effects to this kind of added flexibility. But a lot of the time, you're gonna be a lot more better off with this. With this, we also needed to change the AFXDP RX and TX descriptor.
One of the fields within this descriptor is the address field. This is simply an offset into the UMIM of where your trunk is placed. As the packet travels through the data path, various offsets are added onto this. In the original design of this,
the offsets were added directly to the address field, so the value would change as it made its way through the data path. At the end of it, when we recycled the buffer, we could simply mask back to 2K, 4K, whatever your alignment was, because it was a power of two. This isn't possible anymore without doing complex calculations,
seeing as we have arbitrary sizing and alignment. We moved to a model where we took the upper 16 bits of the address field and stored the offset there rather than adding it to the address field. We kept the lower 48 bits purely for the original base address, or the original offset as it was.
This still gives us 256 terabytes of address space, so we've more than enough for now. What this enables us to do is basically just when we're doing the buffer recycling, just mask off the upper 16 bits. We have our original address, and we're back to where we were.
All of this added flexibility makes the UMM a lot more flexible so we can map directly into it. It really gives us a much more seamless integration with existing frameworks such as DPDK. As Kevin said, now that we've relaxed
our UMM alignment constraints, we can now map our DPDK mBuff pools, no matter what size they are, directly into the UMM. Using our example from earlier with our 3.5K mBuff, we can size our UMM chunk to match that or whatever the mBuff size is.
We can get our seamless zero copy, and we don't need to modify our existing DPDK applications. They're going to work out of the box. Those were two key goals that we outlined at the start of this work. We've achieved those, and in achieving those, we've got both a performant and portable solution.
In terms of performance, this solution gives a 60% improvement on the copy mode, the first one that I showed earlier, and a further 24% upon the first generation zero copy, which was in 1908, which used the external mBuff feature.
It's a pretty significant performance improvement, and the feature itself is available in DPDK 1911, which is the most recent DPDK release. Provided you have kernel 5.4, this feature will be available. If you don't, DPDK will simply fall back to copy mode. I think we're out of time.
Just a quick note before we end. A lot of people ask, what is the value of integrating AFXTP into DPDK? DPDK, as many of you know, it provides an application developer with a wide variety of functionality for an application.
Things like memory and power management, crypto, virtual networking, QoS, the list goes on. AFXTP then provides unrivaled flexibility, and Magnus touched on this in his presentation first thing this morning. In contrast with the typical DPDK usage model,
Nick remains bound to the kernel driver, so we can avail of the kernel control paths and have use of our familiar tools, like ifconfig, ethtool, et cetera. That has a huge impact on the usability of an application and a solution as a whole. Together, essentially,
the best of both worlds can be enjoyed, and we can get applications that are high-performing, portable, fully featured, accelerated, insert buzzword here. Yeah, so I think they're just a good combination together. And then just to close, a couple of words of thanks to some people that helped myself and Kevin along the way with this work.
Magnus and Bjorn on the kernel side, Bruce, Chi, and Zhao Long on the DPDK side, and the DPDK and kernel communities as a whole. Yeah, that's it for myself and Kevin. Yep, cool.
So it really depends on the workload. Off the top of my head, I don't have a number. It really does depend on the workload. Like if it's a heavy workload,
they could be pretty close. If it's something like testpmd or l2-forward, there's gonna be a bigger data. I'm trying to think, is there, I think we might have some data published soon, which we might have some data published soon. We're running some benchmarks at the moment, so that should be public soon enough.