We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Faster memory reclamation with DPDK RCU

00:00

Formal Metadata

Title
Faster memory reclamation with DPDK RCU
Subtitle
Comparing the DPDK and Userspace RCU libraries
Title of Series
Number of Parts
287
Author
Contributors
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
DPDK added a RCU library with a novel method to reclaim resources. We have been running tests to understand the performance differences between the DPDK RCU and the user space RCU library. In our tests, we find that DPDK RCU can perform reclamation faster and perform significantly better when pre-emptive readers are involved. Other than the performance, DPDK RCU has several advantages such as not requiring a background thread for reclaiming resources and the ability to integrate with existing libraries without having to modify the application. This talk will present various testing done on DPDK RCU and the user space RCU library and their results. It will go into the details of pre-emptive reader problem, which affects use cases beyond DPDK, and show that DPDK RCU library can reduce the reclamation time
ArmSoftware engineeringLibrary (computing)Semiconductor memorySpacetimeDiagramEngineering drawing
Semiconductor memoryState of matterMaxima and minimaFrequencyFreewareKernel (computing)Block (periodic table)Thread (computing)Token ringAlgorithmSemiconductor memoryElement (mathematics)Data storage deviceFrequencyTheoryOrder (biology)Token ringTouchscreenFault-tolerant systemData structureFunctional (mathematics)Point (geometry)SynchronizationFacebookGreen's functionExergieArmSoftwareTable (information)ResultantKey (cryptography)Presentation of a groupMultiplicationOnline helpFeedbackAreaFreewarePointer (computer programming)Cartesian coordinate systemElectronic mailing listMultiplication signLink (knot theory)Bellman equationNetwork topologyBlock (periodic table)AlgorithmOcean currentGraph coloringSystem callState of matterTraverse (surveying)BitSource codeComputer programmingMechanism designThread (computing)SpacetimeSineImplementationUniform resource locatorInformationInternet service providerMultilaterationSinc functionWell-formed formulaError messageContext awarenessSuite (music)Computer animation
Thread (computing)Group actionPartition (number theory)Semiconductor memoryData structureElement (mathematics)Hash functionFreewareFrequencyThread (computing)Traffic reportingFault-tolerant systemOrder (biology)Data structureBitLimit (category theory)2 (number)CodePairwise comparisonConfiguration spaceToken ringLevel (video gaming)WorkloadSemiconductor memoryArithmetic progressionInformationResultantGraph (mathematics)Core dumpModal logicPartition (number theory)FreewareState of matterHash functionImplementationPoint (geometry)Different (Kate Ryan album)Element (mathematics)Overhead (computing)Content (media)Multiplication signContext awarenessUniqueness quantificationSystem callInheritance (object-oriented programming)Source codeNumberGrass (card game)Rational numberBenchmarkNetwork topologyTape driveSpacetimeMoment (mathematics)Single-precision floating-point formatRoundness (object)MultiplicationGraph coloringRight angleTable (information)LeakShared memoryMeasurementComputer animation
FreewareSemiconductor memoryComputer animation
Band matrixToken ringMereologyGroup actionGame theoryImplementationFitness functionComputer animation
ImplementationMereologyMeeting/Interview
Band matrixFunctional (mathematics)Hash functionData structureImplementationRing (mathematics)Library (computing)Rule of inferenceComputer animation
Band matrixFunctional (mathematics)Fault-tolerant systemData structureFrequencyHash functionToken ringProjective planeState of matterTransportation theory (mathematics)Computer animation
Food energyRight angleImplementationMeeting/Interview
Group actionSynchronizationThread (computing)Different (Kate Ryan album)Form (programming)Workstation <Musikinstrument>Suite (music)Computer animation
Computer animation
Transcript: English(auto-generated)
Hi, my name is Nathan Brown, and I'm a software engineer at Arm. I'm really excited to talk to you today about faster memory reclamation with DBDK-RCU. Specifically, I'll be comparing the DBDK and USpace-RCU libraries.
First, I'm going to start off with a brief background of what RCU is and what problem it solves. Then, I'll talk about key advantages DBDK-RCU brings to the table. Next, I'll walk us through some performance results I have gathered comparing DBDK and USpace-RCU, and then we'll have some time set aside for Q&A.
Before I get started, I'd like to thank my mentors at Arm, Dharmic, and Hanapa, as well as the broader networking team I'm on. With all of their help and feedback, this presentation wouldn't have been possible, so thank you. Alright, let's get started. What is RCU? Well, with most things we can start with a data structure,
but since we're implementing a multithreaded and high-performance application, we're going to want this to be a lock-free data structure. And while in general RCU can work with any data structure you provide, for simplicity in this presentation, let's just stick to a linked list that has three elements. A, then B, followed by C.
Now, we can have multiple readers traversing through this linked list, accessing elements, more or less doing what they please. And because it's lock-free, we can also have a writer thread that is about to do some updates. However, this raises a very interesting question. How can the writer thread safely remove and free element B?
Well, the lock-free linked list algorithm tells us how to remove element B from the list. We can atomically update A's next pointer to point to C instead of B. In this way, new readers such as Reader 1 will only be able to see elements A, then C in the list, and won't be able to discover B.
However, current readers like Reader 2 and Reader 3 that are already accessing element B will still be able to see a well-formed list and traverse onwards to element C when they're ready. But this idea brings about a bit of confusion. How can we safely free element B? We know we can't do it right now, because if we free element B,
then Reader 2 and or Reader 3 attempts to iterate forward in the list. They'll end up accessing element B after it's already been freed. In short, we'll have a use-after-free error. So we're going to have to wait. But how long do we wait? Well, we need to wait until all readers have stopped referencing element B.
Only at that point in time will it become safe to free element B. But programmatically, how can we do that? How do we know how long to wait? That's what RCU provides. RCU is a suite of mechanisms that enable you to safely reclaim memory in lock-free context
by letting you know how long you need to wait until that memory is safe to free. The way it works is that readers report quiescent states to signal when they're no longer accessing shared state. So when Reader 2 and Reader 3 in our previous example stopped accessing elements on the list, they both entered a quiescent state.
We have grace periods for deleted elements, which is the time between the element being deleted and all readers reporting a quiescent state. This ends up being the lower bound on the delay between deleting the element and being able to reclaim its memory. In short, it's only safe to reclaim the memory after this grace period ends.
Now, we also like to measure the delete-free delay, which is the actual delay between deletion and freeing for an element. While in theory this is lower bounded by the grace period, the actual realized delete-free delay may become higher than that due to an RCU's implementation.
Pictorially, it might look something like this. We have Reader 1 and Reader 2 entering quiescent states, then accessing the data structure, quiescent states, data structure, quiescent states, so on and so forth. When the delete happens, which is the far left bar here, the time until the grace period is up is how long it takes for both readers to enter the quiescent state in green.
However, the freeing may occur much later, and thus the delete-free delay may become longer than the actual grace period. User-space RCU is the most popular user-space implementation for RCU at this point in time.
It's available at this URL. Its API is based on the Linux kernel RCU API. In it, synchronizeRCU is a function that starts detecting and blocks until a grace period is over. So in our element B example, the writer might delete element B, then call synchronizeRCU to start and block until the grace period is over.
Then once synchronizeRCU returns, since the grace period is over, that means that the memory occupied by element B is now safe to reclaim. So the writer is able to free element B. However, we're blocking in the writer thread, and that might not be the most desirable thing. So user-space RCU also provides the API of callRCU where you provide a callback,
and that callback will be invoked by a background thread after the grace period is over. So the writer might delete element B and then call RCU and provide enough information such that the background thread can then execute the callback
and free the memory associated with element B once it is safe to do so. However, with this background thread approach with user-space RCU, the grace period detection might not start immediately, and thus the delete-free delay might end up being longer because the grace period detection doesn't start immediately.
DBDK-RCU, on the other hand, is based around the idea of tokens, where a token more or less correlates one-to-one to an element. So you can get a token by RCU start, and this will start grace period detection for that token.
Then at a later point in time, you can call RCU check and provide a token given to you from RCU start. An RCU check will tell you if the grace period for that token is over. So the way our writer might use this is it will delete element B from the list. Then it will get token B via RCU start.
Now, it knows that it has to wait some time for the grace period to become over. So in the meantime, it's going to do some other useful work. Then it can come back and check, hey, is this grace period over? And if it is, it knows it's safe to free element B. Otherwise, it can continue doing useful work. Now, an immediate advantage you may have noticed
is the increased flexibility DBDK-RCU brings. We weren't blocking in our writer thread, but we also didn't have to rely on yet another background thread to manage our memory reclamation for us. Now, of course, there's nothing stopping us from having a background thread dedicated to memory reclamation, but we don't need to have one either.
The way that we can use the background thread in DBDK-RCU is we generate the tokens in the writer and then pass these tokens off to the background thread. Then the background thread consistently checks the tokens and executes callbacks to reclaim memory as necessary. This is somewhat similar to how a user-based RCU's background thread works.
Something else I haven't mentioned yet is that DBDK-RCU enables you to partition threads into different RCU groups. So, for example, let's take ten threads, and let's say we know that threads one through five are accessing hashmap A and only those threads. Likewise, threads five through B are accessing hashmap B.
Thread five is a bit unique here in that it can access both hashmaps. However, let's say thread seven is a bit slow and it's accessing hashmap B and it's taking some time. So it's not going to report its quiescent state as quickly as, say, threads one through five. So if thread seven is never going to access hashmap A,
but it's the one running slow, why should grace periods become elongated for hashmap A? They don't have to be. We know that thread seven isn't ever going to access hashmap A, and it has no information about the contents of hashmap A or its elements.
And DBDK allows you to model this with these different RCU groups. However, with user-based RCU, all of the threads are global and treated the same. So it doesn't have the context needed in order to be able to provide this additional flexibility. Additionally, in our benchmarks, DBDK-RCU ends up, on average,
producing faster memory reclamation than user-based RCU. This is great. Faster memory reclamation generally leads to a lower memory overhead and less memory fragmentation. Additionally, DBDK-RCU is able to sustain this faster memory reclamation even with preemptible readers.
When a reader can be preempted, it can't report a quiescent state, which can elongate grace periods and delete free delays. However, even in the face of this, DBDK-RCU is still able to perform faster than user-based RCU, which is great. Finally, DBDK-RCU is integrated into some DBDK data structures already,
for example, RT-Hash. This means that with some minor code modifications to the initialization of your data structure, you can start using DBDK-RCU today and already start reaping the benefits. So a little bit about how I benchmarked Delete Free Delay. I took DBDK's lock-free hash table, repopulated it with 4 million elements,
allowed multiple readers and a single writer to start accessing it randomly. So the readers will be looking up random elements, and the writer will be randomly inserting and deleting random elements. In order to simulate the writer doing something else rather than just modifying this hash table, I limited it to about 5k deletes per second.
In order to reach this limit with user-based RCU, I needed to use the call-RCU method of reclamation rather than blocking within the writer thread. This means that for user-based RCU, there is an additional thread going on. So in order to have a bit more of an interesting comparison with DBDK,
I ran DBDK under two different configurations. One where DBDK had a background thread, where the writer thread would generate the token, and then pass the token and callback information to this background thread, which would then reclaim the memory as soon as it was able to, and another one where there was no background thread for DBDK.
Without the background thread, when a writer was performing a delete, it would check and see if it could reclaim memory for any previous deletes, and then if it was able to, it would invoke some callbacks and reclaim the memory at that point in time. So here are some results.
On the y-axis, we have delete-free-delay in milliseconds, and on the x-axis, we're varying the number of readers. In this configuration, we're giving each reader its own core, so we don't have to worry about preemption yet. And here, we can see that I have DBDK, then user-based RCU, and then DBDK with a background thread being measured.
And we can see, wow, the performance results are great. DBDK is consistently outperforming user-based RCU, as is DBDK with a background thread. In fact, we see that DBDK-RCU is 72 to 77% faster to reclaim memory than user-based RCU, and if we add in a background thread, it becomes 79 to 88% faster.
That's great. However, what happens if we put all of these readers on one core? When we have all these readers on one core, only one reader is able to make progress at a time, and the rest of the readers will end up being preempted. Here, we can see a bit of what I was alluding to with the poorer performance once preemption comes into the picture.
User-based RCU, for example, goes from 6.6 milliseconds of delete-free delay all the way up to 154.7 milliseconds of delete-free delay. That's a huge increase. And while DBDK and DBDK with a background thread also have increases as the level of preemption increases,
it's not quite as bad as user-based RCU. In fact, DBDK-RCU ends up being 41 to 61% faster to reclaim memory than user-based RCU in this configuration, and once readers start being preempted, DBDK with a background thread and without a background thread end up performing very similarly.
However, a one-core workload isn't all realistic for every workload out there today, so I also took eight readers and spread them out across various numbers of cores, ranging from one core to two, four, and eight cores. So here in the middle are the two most interesting graphs. So we have eight readers on two cores and eight readers on four cores.
So we both have multiple cores and some amount of preemption on each core. And as we can see, DBDK with and without a background thread still consistently outperforms user-based RCU in this configuration. DBDK-RCU ends up being 29 to 72% faster
to reclaim memory than user-based RCU, and with preemption, DBDK performs similarly both with and without a background thread. An interesting conclusion from this last point is that DBDK ends up waiting more for the readers to report their quiescent state than it is for waiting for a background thread
to try and reclaim memory as fast as possible. So that implies that preempted readers are certainly hampering the performance of these RCU implementations. Thank you very much for your time. I'm looking forward to hearing your questions.
And so as such, our implementation has been tailored fit for DBDK. But the idea of tokens for RCU is broadly applicable.
So if you want to make use of our implementation, it will have to be within DBDK. But if you want to use the token-based RCU, it shouldn't be too difficult to end up porting the token implementation to some other projects as well.
OK. So you mean it's inside DBDK, but we can use it with or using the rest of DBDK if needed. And the other part of the question is, is there something in the implementation which is specific to DBDK? How is it better than libRCU for DBDK?
I see. Yeah, so the main advantages over libRCU comes from the token design and the greater flexibility and better performance. So for those reasons, those are why we opted to write our own RCU implementation for DBDK
rather than use a lib URCU. And since we were writing the implementation for DBDK, we ended up using some of DBDK's other libraries, like the ring data structure or some of the EAL functions.
If we want to use it inside DBDK, we need to have this RCU integrated in the DBDK data structures. You said it's already interpreted for hashes data structure. What about others? I'm not sure about which others,
but integrating into the data structure is only one of the easier ways to use RCU. You're still able to use the main API functions directly yourself, reporting quiescent states, getting tokens with RCU start, and using RCU check
to check if the grace period for that token has ended. And so we've integrated these functions into the RT hash data structure, for example, which you can, of course, draw inspiration from. OK.
Is there someone asking if you plan to write a paper about this work, about this new RCU? Yes. That is a project currently underway. OK, cool. I suppose you are also discussing with the father of RCU, Mr. McKinsey, if I remember well.
Yes. I believe he may have been involved with the reviewing of the implementation. Yes. Right. OK. OK, last question. You are talking about partitioning threads. Can you explain more about that? Sure.
So libuRCU treats all threads the same, all of the reader threads the same. Whereas with DBDKRCU, you can form different groups of reader threads that are independent of each other. And so if one group ends up not performing as fast as another group, or perhaps there's more of them,
so synchronization takes longer, you can end up separating them entirely.