Mixing kool-aids! Accelerate the internet with AF_XDP & DPDK
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 490 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/46949 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSDEM 2020107 / 490
4
7
9
10
14
15
16
25
26
29
31
33
34
35
37
40
41
42
43
45
46
47
50
51
52
53
54
58
60
64
65
66
67
70
71
72
74
75
76
77
78
82
83
84
86
89
90
93
94
95
96
98
100
101
105
106
109
110
116
118
123
124
130
135
137
141
142
144
146
151
154
157
159
164
166
167
169
172
174
178
182
184
185
186
187
189
190
191
192
193
194
195
200
202
203
204
205
206
207
208
211
212
214
218
222
225
228
230
232
233
235
236
240
242
244
249
250
251
253
254
258
261
262
266
267
268
271
273
274
275
278
280
281
282
283
284
285
286
288
289
290
291
293
295
296
297
298
301
302
303
305
306
307
310
311
315
317
318
319
328
333
350
353
354
356
359
360
361
370
372
373
374
375
379
380
381
383
385
386
387
388
391
393
394
395
397
398
399
401
409
410
411
414
420
421
422
423
424
425
427
429
430
434
438
439
444
449
450
454
457
458
459
460
461
464
465
466
468
469
470
471
472
480
484
486
487
489
490
00:00
InternetworkingIntelSoftwareComputer animation
00:22
Process (computing)ArchitectureBefehlsprozessorDevice driverSpacetimeAddress spaceSinguläres IntegralComputer networkMemory managementLatent heatAsynchronous Transfer ModeSocket-SchnittstelleData modelOrder (biology)FamilyKernel (computing)Variety (linguistics)Library (computing)Module (mathematics)Memory managementSinguläres IntegralSet (mathematics)Greatest elementDiagramSoftwareStack (abstract data type)SpacetimeSocket-SchnittstelleProcess (computing)MikroarchitekturAddress spaceDevice driverWorkloadMiniDiscCartesian coordinate systemEndliche ModelltheorieLatent heatPiComputer architectureMathematical optimizationOcean currentPhysical systemDifferent (Kate Ryan album)Computer animation
02:03
Statement (computer science)Software frameworkMemory managementConstraint (mathematics)Data bufferDiscrepancy theoryTexture mappingDirected setAreaEquals signSinguläres IntegralSpacetimeWeb pageAddress spaceBoundary value problemMessage passingInstance (computer science)Process (computing)Heegaard splittingCASE <Informatik>Moment (mathematics)Semiconductor memoryINTEGRALSpacetimeSound effectAreaOrder (biology)Constraint (mathematics)Web pageDifferent (Kate Ryan album)Equaliser (mathematics)MappingCartesian coordinate systemQuicksortMemory managementWebsiteForm (programming)Data structureData bufferLine (geometry)OctahedronBitSampling (statistics)CalculationBoundary value problemPower (physics)Term (mathematics)NumberGoodness of fitSinguläres IntegralSoftwareComputer configurationDevice driverGroup actionSystem on a chipAddress spaceMobile appCuboidComplex (psychology)Buffer solutionDiscrepancy theoryVector potentialSoftware frameworkComputer animation
06:22
Web pageBoundary value problemAddress spaceDifferent (Kate Ryan album)Moment (mathematics)Computer animation
06:55
Asynchronous Transfer ModeCycle (graph theory)Kolmogorov complexityPoint (geometry)AdditionCASE <Informatik>GodWeightSemiconductor memoryComplex (psychology)Normal (geometry)Uniform resource locatorOcean currentINTEGRALAsynchronous Transfer ModeOrder (biology)BitPoint (geometry)Cycle (graph theory)FreewareResource allocationSound effectTable (information)Cartesian coordinate systemType theorySeries (mathematics)Term (mathematics)CodeMereologyDifferent (Kate Ryan album)Data structureTheory of relativityDegree (graph theory)Electronic mailing listNumberEmailCuboidMobile appSinguläres IntegralGoodness of fitComputer animation
10:15
Singuläres IntegralWeb pageBoundary value problemAddress spaceData bufferFile formatWeb pageBoundary value problemSemiconductor memoryField (computer science)Constraint (mathematics)CASE <Informatik>BitCalculationPower (physics)Line (geometry)Auditory maskingTrailAddress spaceNetwork topologyEndliche ModelltheorieMereologySound effect40 (number)Data storage deviceINTEGRALSinguläres IntegralSoftware frameworkBuffer solutionComputer animation
13:17
Constraint (mathematics)Principle of maximum entropyDisintegrationConstraint (mathematics)CASE <Informatik>Electric generatorTerm (mathematics)Portable communications deviceCuboidLevel (video gaming)Cartesian coordinate systemAsynchronous Transfer ModeSinguläres IntegralComputer animation
14:38
CryptographyMemory managementComputer networkFunction (mathematics)Control flowUsabilitySinguläres IntegralCartesian coordinate systemInsertion lossSound effectElectronic mailing listFunctional (mathematics)Multiplication signVideo gameScripting languageDevice driverEndliche ModelltheorieCASE <Informatik>Semiconductor memoryPresentation of a groupWordCombinational logicGame controllerUsabilitySoftware developerSinguläres IntegralContrast (vision)Variety (linguistics)Goodness of fitVirtualizationPower (physics)Computer animation
15:54
Singuläres IntegralSinguläres IntegralWordComputer animation
16:20
Link (knot theory)Row (database)Fisher informationRankingNP-hardSoftware testingWorkloadNumberBenchmarkComputer animation
17:02
Open setPoint cloudFacebookOpen source
Transcript: English(auto-generated)
00:05
So we'll kick off. Hi everyone. My name is Kevin Latz. This is Ciara Loftus. We're network software engineers out of Intel in Shannon, which is hidden away in the west of Ireland. Today we're going to mix some kool-aids and hope things go quicker in the internet then.
00:24
So most of you probably know what DPDK and FXTP are. For those of you who don't, we'll do a quick introduction. So DPDK is a set of user space libraries and drivers. They aim to accelerate packet processing workloads, and they run on a variety of CPU architectures.
00:43
Some important things to remember for this talk is that DPDK supports many different PMDs, and they're usually device-specific, and DPDK has its own memory management system. FXTP is a kernel-based address family optimized for high-performance packet processing.
01:04
FXTP has its own sockets in order to move packets from kernel space to user space, and it uses the in-kernel fast path, so it bypasses the network stack in order to move those packets quickly. So if we take a closer look at a simplified diagram of the traditional DPDK model, down at the bottom in kernel space, we have
01:24
DPDK-specific kernel modules. They interact with the NICs and expose them to user space. In user space, then we have all of our DPDK PMDs and our applications, and they work together in order to do whatever wonderful things you want to do with your packets.
01:45
The aim of this work was to introduce and use the new DPDK FXTP PMD, and then that will directly talk to your NIC driver, so you can still use all of your usual kernel tools that you like using, like IFconfig and so on.
02:05
So the goal of this work was to have all DPDK applications working out of the box with the new FXTP PMD, and of course it should do so with good performance. The performance we were aiming for was close to or on par with the
02:22
kernel sample app XDP SoC. The challenge with this was that frameworks like DPDK have their own memory management, as said, and these come with constraints and assumptions of their own. DPDK specifically, we have a discrepancy between the
02:42
DPDK and the AFXTP buffer alignment, so this prevents us from mapping DPDK memory buffers directly to AFXTP UMIMs, and in order to do this mapping, we needed to do some extra work and complexity which negatively impact performance.
03:03
Okay, so I'm going to talk about how both AFXTP and DPDK lay out their memory for packet handling. I'll talk about the differences between the two and why those differences pose the integration challenges, which Kevin just touched on there.
03:21
So AFXTP, it's got this concept of a UMIM or user memory, and it's essentially an area of memory allocated by the user for packet data. The UMIM, it's split up into equal sized chunks with each chunk being used to hold data from a particular packet and
03:40
how it's used is, for instance, on the receive path, the kernel will place packet data into a chunk for the user space process to retrieve, and in our case, our user space process is DPDK, and on the transmit path, the user space process places packet data into a chunk for the kernel NIC driver to transmit.
04:03
Prior to kernel 5.4, this UMIM, this area of memory that AFXTP uses to hold packet data, it had a number of restrictions on it in terms of its size and its alignment. The first being that the start address of the UMIM had to be page size aligned, so that's going to be 4K in most cases.
04:24
The chunks within the UMIM, they had to be power of two sized, and kind of as a side effect of that, the chunks could not cross page boundaries. And in a kind of networking use case, that leaves you really with only two potential chunk size options, either 2K or 4K.
04:43
Anything bigger than 4K, and you're going to cross the page boundary, and anything smaller than 2K isn't big enough for a networking package or a networking use case. So in this example here, we've got a chunk size of 2K. We've two 2K chunks per 4K page, and as you can see, none of the chunks are crossing the page boundary.
05:05
Everything is nice and neat and tidy. The reason for these restrictions is essentially it just makes calculations in the kernel a little bit easier. When everything is nicely aligned, you can use things like masks, etc. Okay, so let's
05:21
see how DBDK lays out its memory for packet handling, and see if it satisfies the requirements of the AFXTP UMIM. So DBDK, as many of you know, we're in the SDN room. It holds packet data inside structures known as memory buffers, or MBUFS for short.
05:41
And a group of those together is known as an MBUFPool. And DBDK MBUFPools, they don't have as strict restrictions on them as the AFXTP UMIM. So for instance, MBUFS can be of any size within reason, and they can have arbitrary alignment relevant to the page size, so they can cross page boundaries.
06:03
So in this example here, we've got an MBUFP size of maybe 3.5K, and our MBUFPs are crossing page boundaries all over the place. And I suppose why do we care whether or not the DBDK MBUFPool satisfies the requirements of the AFXTP UMIM?
06:22
And the reason is that in order to get the highest performing integration of AFXTP and DBDK, we need to map the MBUFPool directly into the UMIM to get a zero copy data path, which is obviously going to be the best, the most performant. But as you can see here, that's not possible at the moment.
06:42
This is just one example of a DBDK MBUFPool. There's plenty more examples of different sized MBUFS, different alignments, and most of them won't comply with the kind of restrictions of the UMIM. But to get around this, the clever folks in the DBDK community have come up with a number of solutions
07:01
to get them to integrate and work together. Each of them have a varying degree of success in terms of performance. So the first solution that was considered was copy mode. And in this mode, we allocate memory for our UMIM, and we also allocate our DBDK UMIPool as normal.
07:23
And we simply memcpy between the two locations in memory. This works really well, but in terms of performance, it's not the most performant, just due to the cycle cost of the memcpy being pretty high. But nevertheless, it made it into a DBDK release in 1905
07:41
as part of the series that initially introduced AFXTP support. The second approach then that was looked into was this alignment API. So it was proposed to introduce a new API into DBDK, which allowed you to specify the type of alignment you wanted for your MBUFPool.
08:04
And then any application you wanted to work with AFXTP could use this new API and mold its MBUFPool to fit the UMIM requirements. Then you could do the one-to-one mapping, and you could get your zero copy performance.
08:20
But even though this did give really, really good performance, it was deemed a bit too invasive, so it didn't make it into a DBDK release. It was invasive because you had to change your application to get it to work, which kind of went against what Kevin said at the start about apps needing to work out of the box. So that didn't get into a DBDK release, but it generated a good discussion on the mailing list, which led to this third approach.
08:46
I think it was suggested by Oliver Matz and implemented by Xiaolong Ye. And this approach uses DBDK's external MBUF feature, which allows DBDK MBUF to, instead of holding the packet data in the structure itself,
09:02
to point to a different location in memory. In this case, we'll be pointing to our UMIM chunk. And then you can achieve your zero copy. However, there are still additional cycles with this solution, so there's additional complexity involved in attaching and detaching that external piece of memory from your MBUF.
09:26
But then again, it does give a really, really good improvement over copy mode, I think 29% for a certain use case. So it made it into DBDK 1908 as kind of a first gen AFXTP zero copy solution.
09:42
At this point, we felt that the community had taken DBDK as far as it could in terms of performance with AFXTP. But we still felt that there was still some performance left on the table, some cycles to save. So at that point, we decided it would be a good idea to start looking at the kernel side of things,
10:01
and maybe looking at adapting the UMIM to make it a bit more flexible, to work with the flexibility of DBDK, as opposed to trying to make DBDK fit the narrow restrictions of the UMIM. So what did we do in the kernel when we finally took off our DBDK ads?
10:23
We took a look at the original UMIM and its constraints. Being page size aligned was one major restriction that we had to lift, so we enabled arbitrary trunk alignment, so you can now align your trunks anywhere you want within the UMIM.
10:44
As a part of this, we allowed arbitrary trunk sizing as well. So now you can size and align wherever you want within the UMIM, much more flexible than the original. With this, we also had to allow the crossing of page boundaries, so we now need to keep track of whether pages are physically contiguous in memory or not.
11:04
If they aren't contiguous, like chunk three in this case, let's assume page three is non-contiguous to page two, then it will cross into a non-contiguous memory region, so we can't use that address, we discard it to get a new one, and we use the start of the next page.
11:22
So we do have a gap in memory. This is just one of the side effects to this kind of added flexibility. But a lot of the time, you're gonna be a lot more better off with this. With this, we also needed to change the AFXDP RX and TX descriptor.
11:43
One of the fields within this descriptor is the address field. This is simply an offset into the UMIM of where your trunk is placed. As the packet travels through the data path, various offsets are added onto this. In the original design of this,
12:02
the offsets were added directly to the address field, so the value would change as it made its way through the data path. At the end of it, when we recycled the buffer, we could simply mask back to 2K, 4K, whatever your alignment was, because it was a power of two. This isn't possible anymore without doing complex calculations,
12:21
seeing as we have arbitrary sizing and alignment. We moved to a model where we took the upper 16 bits of the address field and stored the offset there rather than adding it to the address field. We kept the lower 48 bits purely for the original base address, or the original offset as it was.
12:43
This still gives us 256 terabytes of address space, so we've more than enough for now. What this enables us to do is basically just when we're doing the buffer recycling, just mask off the upper 16 bits. We have our original address, and we're back to where we were.
13:01
All of this added flexibility makes the UMM a lot more flexible so we can map directly into it. It really gives us a much more seamless integration with existing frameworks such as DPDK. As Kevin said, now that we've relaxed
13:22
our UMM alignment constraints, we can now map our DPDK mBuff pools, no matter what size they are, directly into the UMM. Using our example from earlier with our 3.5K mBuff, we can size our UMM chunk to match that or whatever the mBuff size is.
13:41
We can get our seamless zero copy, and we don't need to modify our existing DPDK applications. They're going to work out of the box. Those were two key goals that we outlined at the start of this work. We've achieved those, and in achieving those, we've got both a performant and portable solution.
14:03
In terms of performance, this solution gives a 60% improvement on the copy mode, the first one that I showed earlier, and a further 24% upon the first generation zero copy, which was in 1908, which used the external mBuff feature.
14:20
It's a pretty significant performance improvement, and the feature itself is available in DPDK 1911, which is the most recent DPDK release. Provided you have kernel 5.4, this feature will be available. If you don't, DPDK will simply fall back to copy mode. I think we're out of time.
14:43
Just a quick note before we end. A lot of people ask, what is the value of integrating AFXTP into DPDK? DPDK, as many of you know, it provides an application developer with a wide variety of functionality for an application.
15:02
Things like memory and power management, crypto, virtual networking, QoS, the list goes on. AFXTP then provides unrivaled flexibility, and Magnus touched on this in his presentation first thing this morning. In contrast with the typical DPDK usage model,
15:22
Nick remains bound to the kernel driver, so we can avail of the kernel control paths and have use of our familiar tools, like ifconfig, ethtool, et cetera. That has a huge impact on the usability of an application and a solution as a whole. Together, essentially,
15:41
the best of both worlds can be enjoyed, and we can get applications that are high-performing, portable, fully featured, accelerated, insert buzzword here. Yeah, so I think they're just a good combination together. And then just to close, a couple of words of thanks to some people that helped myself and Kevin along the way with this work.
16:02
Magnus and Bjorn on the kernel side, Bruce, Chi, and Zhao Long on the DPDK side, and the DPDK and kernel communities as a whole. Yeah, that's it for myself and Kevin. Yep, cool.
16:29
So it really depends on the workload. Off the top of my head, I don't have a number. It really does depend on the workload. Like if it's a heavy workload,
16:41
they could be pretty close. If it's something like testpmd or l2-forward, there's gonna be a bigger data. I'm trying to think, is there, I think we might have some data published soon, which we might have some data published soon. We're running some benchmarks at the moment, so that should be public soon enough.