We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Concurrent data structures: Fast and Space-Efficient Queues via Relaxation

00:00

Formal Metadata

Title
Concurrent data structures: Fast and Space-Efficient Queues via Relaxation
Title of Series
Number of Parts
30
Author
License
CC Attribution 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Efficient message-passing implementations of shared data types are a vital component of practical distributed systems, enabling them to work on shared data in predictable ways, but there is a long history of results showing that many of the most useful types of access to shared data are necessarily slow. A variety of approaches attempt to circumvent these bounds, notably weakening consistency guarantees and relaxing the sequential specification of the provided data type. These trade behavioral guarantees for performance. We focus on relaxing the sequential specification of a first-in, first-out queue type, which has been shown to allow faster linearizable implementations than are possible for unrelaxed queues. The algorithms which showed these improvements in operation time tracked a complete execution history, storing complete object state at all n processes in the system, leading to n copies of every stored data element. In this paper, we consider the question of reducing the space complexity of linearizable implementations of shared data types, which provide intuitive behavior through strong consistency guarantees. We improve the existing algorithm for a relaxed queue, showing that it is possible to store only one copy of each element in a shared queue, while still having a low amortized time cost. This is one of several important steps towards making these data types practical in real world systems.
Priority queueSpacetimeInterface (computing)Abstract data typeData typeAbstractionSequenceOperations researchConstraint (mathematics)Axiom of choiceParameter (computer programming)System programmingDependent and independent variablesAlgorithmMessage passingProcess (computing)Local ringBit rateKolmogorov complexityMessage passingSpacetimeUniverse (mathematics)Physical systemQueue (abstract data type)Computer scienceCASE <Informatik>Student's t-testImplementationPersonal digital assistantElement (mathematics)Data typeReal numberOperator (mathematics)AlgorithmComplex (psychology)Multiplication signTerm (mathematics)Interface (computing)Dependent and independent variablesProof theoryOrder (biology)Validity (statistics)MultiplicationMultilaterationMaxima and minimaError messageSequenceEndliche ModelltheorieBound stateAbstractionLatent heatParameter (computer programming)Process (computing)Real-time operating systemBit rateBuildingInteractive televisionTrailProgrammer (hardware)Instance (computer science)Genetic programmingEvent horizonGreatest elementNeuroinformatikVariancePoint (geometry)Replication (computing)Set (mathematics)Shift operatorMassDatabase normalizationDivisorNumberCausalityUniform resource locatorGoodness of fitSystem callComputer programmingSkewnessDifferent (Kate Ryan album)Differenz <Mathematik>KettenkomplexBitChemical equationServer (computing)Functional (mathematics)CuboidMeasurementLocal ringPrice indexArchaeological field survey1 (number)Axiom of choiceLevel (video gaming)Roundness (object)Data storage deviceComputer animation
AlgorithmData storage deviceSpacetimeElement (mathematics)Dependent and independent variablesProcess (computing)Strategy gameCountingProcess (computing)Element (mathematics)Physical systemQueue (abstract data type)2 (number)Dependent and independent variablesAlgorithmData typePerspective (visual)Roundness (object)TelecommunicationProcedural programmingOrder (biology)Message passingDifferential (mechanical device)Different (Kate Ryan album)Complex (psychology)SkewnessCASE <Informatik>Multiplication signHeegaard splittingNeuroinformatikState of matterStructural loadChemical equationShared memoryReal-time operating systemStudent's t-testLastteilungData storage deviceSingle-precision floating-point formatPoint (geometry)Modulo (jargon)SynchronizationSemiconductor memoryLevel (video gaming)VarianceInstance (computer science)Interactive televisionData structureOperator (mathematics)Figurate numberLocal ring1 (number)MultilaterationSimilarity (geometry)Price indexTerm (mathematics)Limit (category theory)MereologyEndliche ModelltheorieObservational studyOcean currentKettenkomplexScaling (geometry)Moment (mathematics)FrequencyComputer animation
CASE <Informatik>Process (computing)Element (mathematics)Data storage deviceMilitary operationCountingTheoremKolmogorov complexityReplication (computing)Extension (kinesiology)Data typeReduction of orderBuffer solutionMessage passingSystem programmingElement (mathematics)Data storage deviceMultiplication signData typeQuicksortNominal numberTotal S.A.SpacetimeLimit (category theory)Degree (graph theory)NumberUsabilitySoftware frameworkStudent's t-testComputer configurationEndliche ModelltheoriePresentation of a groupMathematicsSocial classFrame problemOnline helpUltraviolet photoelectron spectroscopyMereologyShared memoryComplex (psychology)Default (computer science)Division (mathematics)Queue (abstract data type)ResultantStructural loadCuboidPhysical systemNetwork topologyDivisorCASE <Informatik>Level (video gaming)Mathematical analysisProcess (computing)Goodness of fitInformationGroup actionMusical ensembleCycle (graph theory)AreaCubeOrder (biology)Replication (computing)StatisticsTrailMessage passingSound effectInstance (computer science)Maxima and minimaUniform resource locatorBuffer solutionSynchronizationBound stateRoundness (object)IntegerModulo (jargon)CountingComputability theoryApproximationsalgorithmusFault-tolerant systemRevision controlDirection (geometry)ImplementationReal numberNeuroinformatikEmailFraction (mathematics)Computer animation
Transcript: English(auto-generated)
Hello, my name is Edward Telmich. I'm an assistant professor of computer science at Bucknell University, and I'd like to discuss my work with my student Dempsey on efficient queues, particularly distributed implementations of queues, looking for space efficiency while maintaining good time performance. So the goal here is to take a data type,
in this case queues, and implement that on a message passing system in an efficient manner. So typically that means time, but we're also in this work particularly interested in space efficiency. And what we have is that notion of relaxation, where we take a data type and
alter it in a particular way, enables more space efficiency and keeps time efficiency that previously shown to be possible through relaxation. So starting with some background
definitions, when I talk about data type and abstract data type, this is the interface that a data type requires. So what operations can you invoke and then what behaviors will those operations have? So just abstract means not implementation details, just what's the behavior.
We talk about the behavior in terms of invocation responses. If I invoke with this particular argument, what response can I get back? What's valid behavior for the data type? Then we're looking specifically at relaxed data types, in this case a relaxed queue data type. Relaxation, we take this interface specification and we say we're going to allow some extra
behaviors that would not typically be correct behaviors. This is particularly interesting and useful when what happens is we allow multiple possible return values for the same invocation, and we'll get into why that's useful a little bit later.
But that non-determinism or multiple correct behaviors can give us an edge in getting good efficiency out of these systems. And of course there are many, many relaxations. We're just going to focus on one as a proof of concept that yes, a relaxation of a data type can increase the
possible performance. And that particular relaxation we're going to consider is the out of order relaxation. So for a queue, if we're going to relax the dequeue, it turns out not to be helpful to relax the enqueue because it can be implemented quickly already. So we still have
the same operations as usual, an enqueue which takes one argument, x has no return value indicated by the dash, dequeue doesn't take an argument but it does return one value. And then a sequence of enqueue and dequeue instances is legal if each dequeue is returning not the oldest argument that has not already been dequeued, of an enqueue that has not already been dequeued, but one of the k
oldest, where k is the relaxation parameter. So the larger k is, the more relaxed the data type, the more possible return values that are for dequeue. And if there are fewer than k elements in the queue, then we allow this returning a special bottom symbol for an empty queue. Note that that's possible even if there are a few elements in the queue, just not k. So pictorially
let's, what's this look like? So initially if the queue is empty, so four out of order here, so our parameter k is four, then the first four values would be illegal. There are no values, so we can return bottom, that's the only thing. If we enqueue something, well then there's now one
value we could return, but we could still return bottom because there are fewer than k elements. As we continue to enqueue, once we have k elements, then we're going to get one of these if we run a dequeue. For the sake of the example, we'll go ahead and enqueue a few more
elements. And so these are all in the queue. A traditional unrelaxed dequeue would always return one if I dequeue at this point, but a relaxed dequeue can return one, two, three, or four, so it might return one, that's fine. And then the set of elements that are legal to return shifts down because it's the oldest four elements. And if I dequeue again, I could get two,
or perhaps three. But once I get three, note that six is now legal to dequeue because it's the four that are still in queue, two, four, five, and six. And this may continue, I could get six next, I could get five next, and so on. So this is a relaxation of a dequeue where instead of returning
the single oldest element currently in the queue, it can return one of the k, in this case four oldest elements in the queue. So that is the type of non-determinism or multiple possible legal return values for a given state. So why is this useful? Well, we're trying to provide
a particular behavior to a user. So in user land, we're used to using data types, we call these functions, they behave and certainly give us these responses, we're used to this kind of interface. But our actual system is a message passing system
where there are lots of messages going between different processes in the system, and they can cross each other in time and location, and it can be very hard to keep track of them. So we want to let programmers, programmers that are accustomed to doing, interact with their data the way they're used to, and hide the complexity of passing
messages back and forth between many different processes. So we're trying to build this abstraction layer, and relaxation is going to let us do that more efficiently. So model details, n will be our number of processes, we're going to use a partially synchronous model, so we have
bounds on message delays. Each message takes at most d times, that's maximum message delay, and then u is our uncertainty, so d minus u is the minimum time, and a message may take that any amount of real time in that interval between when it's sent and when it arrives.
We assume that every process has its own clock that runs at the rate of real time, and these are approximately synchronized to one minus one over n times u, which so as n gets large, then the error between two clocks approaches u. And then we set up our algorithm as an event handling algorithm, where events that can trigger
behavior are operation invocations from the user, a message arriving from another process, and then we allow setting timers because we have local clocks, and so we say at this point after this much time has elapsed, then I'm going to do something else. So we measure time complexity
in terms of the message passing delays. We're not as interested in computation time because that's relatively small compared to the delay in passing message between physically remote computers. So we know from past work that in an unracked DQ, where every DQ must return the oldest element,
then an implementation takes d plus epsilon, so maximum message delay plus variance in clocks, time between when you invoke a DQ and when it can respond. So the longer your messages take, then the slower your operations are. Previously we showed
with the out of order K relax DQ, while the worst case time doesn't improve, the amortized time goes down by almost a factor of the relaxation per node. So k is the relaxation, k over n is
basically how much relaxation can be applied to each node. So if there are k elements that the DQ can return, k over n is how many of those elements each process can have if we split them evenly. And so for a significant amount of relaxation relative to the number of processes,
not the amount of data which is probably much larger, then our amortized time can decrease significantly below what is possible for an unracked DQ. So taking that, our work in this paper says relaxation can give us good time complexity, but both of those previous
algorithms for unracked and unracked queues had every process keeping a copy of the entire queue, very bad space complexity. Can we improve that? And can we improve that without losing the time improvements that relaxation gave us? And the answer is yes, we can, though there's a little bit of time, but we can get rid of the massive redundancy in space complexity. Now,
in practice, you probably want some replication. You don't want to remove all redundancy in your data storage, but for the sake of exploring the bounds of what's possible, we're going to go all the way down to no replication and show that that's still possible
and efficient, and then if that's possible, then some replication should also be possible. So an overview of the algorithms. We're building on the previous algorithm for relaxed queues, so we'll first talk about how that works at a high level and then get into some of the things we add to enable space efficiency. So if we have several processes and an invocation or
the user at process zero says I want to enqueue 10. Process zero will announce that enqueue operation instance to all the other processes, and because enqueue doesn't actually have to return a value, it can go ahead and respond to the user. So this is a very quick interaction from the user's perspective. They invoke and almost immediately they get a response
back. In the background, the rest of the system continues to update its stored copies of the structure so it has proper state. So these messages take some time, there's variance in how long they take. Once you receive a message about an invocation, you wait a predetermined
amount of time to synchronize and make sure that you know about everything that's happened so far. And then after those timers expire, each process locally executes that invocation on their own copy of the queue, so everybody stores the 10. And these may happen at different real times at different processes, but we can have them all proceed through the invocations in
same order so that their states match. What we want to do now is take that idea and instead of having a complete copy of the queue at every process, split it up. So we're going to do one thing towards that end and then we're going to actually make another note
on something we can do for the common case for time performance. So we're going to separate storage into those things that may be returned very quickly and those things that may not be returned for a while. So those first oldest k elements in the queue may be returned at any time.
They're legal for DQ to return right away. All the other elements, it's going to be some time before they can be returned because other DQs have to happen first. And so we can spread them out among the processes and say those cold storage elements, it's okay if it takes some time to access them because we're going to have some warning that they're going to become legal to
before they are actually going to be returned. And if we update that more quickly then we can improve our common case where the time from when I DQ an element to when I have another element ready to DQ is reduced. And so these two types of storage, we take the idea that processes are
going to claim elements legal to return. So the efficiency of the previous algorithm for relaxed queues was based on if there are k elements we'll split those k elements up among the n processes so each process gets k over n elements and it can return those elements quickly because
it has them already whenever it sees a DQ it knows that it has claimed this element no one else will touch it so it can return it right away without waiting. If it has no such elements then it has to communicate with the other processes to determine which element it can return because it doesn't know have an element that it knows no one else will return.
So if we take those claimed elements and observe that once it's claimed no other process will touch it then we only need to store one copy of it at the process that's claiming that element and that's where it needs to be to return so that's efficient. All of the other elements in
stored elements those we just distribute in a round robin fashion for load balancing and it turns out that you could try to do something fancy and say we're going to put one here and one over there and who has the least and try to figure that out but that ends up
being more communication and more computation and more delay whereas if you just do a round robin then the worst case is minimized. If you know something about the load then perhaps you could do some balancing with some statistics but for now we focused on just what is the worst case
and what this gives us is that each process holds roughly a one in n share of the total data that has been put into the queue so we get an even distribution. So the idea is straightforward. Credit goes primarily to my undergrad research student Dempsey for hashing
out the details of this idea but then the work from this point on is just to make sure that the details work that we can prove it's correct that we can prove the complexity. So let's look a little more in detail at how this system actually works. So we go back to here's the state of our queue we have 10 elements so we're going to rename the first k elements
the claimed elements and the rest of the stored so claimed may be returned by dq immediately stored will not be returned yet. We're going to assign each element to a process in round robin order the claimed elements as long as they're evenly distributed the exact order doesn't matter the stored we do need to have in round robin order and then we distribute them so that we
have only one copy of each element stored at a single process. So we can see the process zero has claimed element one and that's ready to return so it's in blue and then five and nine are stored for later and process zero may not return five and nine there's no way to know who's going to need those elements at this time. Similarly process one has two claimed and six and
ten stored for later. So now if we go through the same scenario as before we get an enqueue we respond quickly we tell everyone hey I'm trying to enqueue 11 we have the same delays and then when we get to the point where each process should execute that on their local copy
most of the processes will not actually do that they'll not store the 11 because we only need to store one copy of it and we're going to do that in round robin order process zero got nine process one got ten so process two should get the next one so process zero one and three do nothing here not technically nothing we have to count how many instances there have been
modulo n and that counting lets us figure out round robin who gets the next one and we'll use it again when we do dequeues but just maintaining one single counter modulo n is vastly more efficient than maintaining an entire copy of the queue then when process two
finishes its delay for synchronization it will locally execute the enqueue because it knows it's next it's the next process to store and it'll store element 11 in its local memory now if we invoke a dequeue we'll announce that to everyone
the invoking process because it has a claimed element can return immediately without waiting for communication we go through a similar procedure we send the messages out once we receive a message about a new invocation we wait a predetermined amount of time to synchronize then we locally execute that so process two actually executes the dequeue because it needs
to remove three from its local copy process one does nothing process three when its timer finishes will do nothing process zero will note that it has the next the first stored element so the first element stored for later here is five and so it will remove that from its long-term
storage and send that to restock process two's claimed elements now this does increase delay but note that this is all happening in the background the dequeue invocation has already received a response so from the user's perspective is a very quick operation in the background process zero will send element five back to process two in the meantime process three will
just count that there was another dequeue process two will receive it wait to synchronize and then store five this new element in its claimed queue so now process two gets another dequeue it'll be able to return five promptly if there had been a dequeue in between before five was placed in claimed it process two then process two would have had to wait and couldn't
respond quickly and so we kind of we have this differentiation between two different kinds of dequeues if a dequeue arrives at a process that has a current acclaimed element then it can respond almost immediately if it arrives a process that has not claimed any elements because it's
run out then there's a delay while that process tells everyone i need an element and waits to receive one from whichever node has the appropriate element in storage we don't want to just return the first stored element here because that may not be among the first k if for example
the first stored element was the eight at process three we would want to get you know five or six first so the slow dequeue takes two message delays but the fast dequeue which is most of the time in this particular example half the time but if k is even larger then it'll predominantly the most common occurrence will be just one clock skew to synchronize everything
make sure we maintain state so this type of restocking as soon as process zero realizes that process two needs a new element because it has dequeued is what we refer to as aggressive restocking for the sake of the presentation and so if you think back to the example of the
out of order every time you dequeue an element the next element in the queue becomes that's the first k becomes legal to return and so every time we dequeue whoever has the old stored element will remove that from stored and send it back to the dequeuing process as
its new claimed element and our round robin ordering is important because as we count dequeues we can keep track of who has the next element and again it's just an integer modulo n so it doesn't take a lot of space to keep track of who's going to be
returning that next restock element when and what we have here is if i don't invoke too many dequeues too fast at one process because if i do that if i invoke a bunch of dequeues in a small time frame i'll run out of claimed elements and i'll have a slow dequeue to go get some more but if i invoke one and wait a little bit and invoke another you know a fairly
normal load as opposed to all in a rush then by the time i run out of claimed elements that first dequeue will have received a restock and i'll have more claimed elements and if i never run out of claimed elements then every individual dequeue will be fast so this is kind of the common case if i'm not just trying to pull as much data out of the queue as possible at one particular node then each instance will be quick and the rest of my computation is not delayed
waiting for the queue so then we get to our results and they are what you would expect intuitively so if t is the maximum number of elements in the queue at any one time then each process just needs t over n space for its share of the stored element so if it runs
out of claimed elements if all processes are out of claimed elements then all the elements are in storage and they're evenly distributed so you get t over n plus maybe one if the number of elements is not evenly divisible by n but then you may also have some claimed elements but that's limited so that no process ever claims more than an even share of the k to keep the
worst case low again if you have external knowledge of the load and where that you know this process is going to have more to use than that process then you could adjust this so that that the busier process can claim more elements but we're just doing everything level here for
the worst case analysis the other way to think about this bound the other way you could phrase it is that at any time only one process is storing a copy of each element not counting message buffers because we have to broadcast it to everyone but then most processes will just discard
that and so in their storage where they're actually keeping elements there's only one copy system-wide of each element at any time our time complexity so our amortized time because we have fast and slow DQs we need amortized time to get a good picture of what's going on 2d is the time here for our slow DQ that sends a message out and then receives the restock
back before it can return and then we divide by k over n because that's how many fast DQs we have and then there's actually epsilon cost for each fast DQ but then that cancels with the k over n
that's actually k over n because we have k over n fast DQs plus one slow DQ and so that's our denominator for how many DQs we have for how much total cost and the important thing to note here is with a relatively small k so four times the number of processes this works out to be
distinctly less than the d plus epsilon that was the best performance we could have with a non-relaxed Q so taking our space complexity down from total number of elements times n because
every process kept a copy down to just total number of elements which is the minimum possible we've only we still have time complexity less than is possible without relaxation the two here is in orange because with full replication we had just the one d plus u because we didn't have
to wait for elements to be sent back as those restocks but that small factor of two that's still divided by our k over n is some cost we have to pay and one thing to note is that other somewhat more complicated and more useful relaxations already have a 2d in their amortized cost and so
our hope as we continue working on this is that this result will not actually touch the time performance for those at all while reducing the amount of space required so the conclusion is relaxing data types doesn't just give good time performance it also gives good time
performance even without replicating the data and so future work is would look at how much replication would be good though that gets into fault tolerance which we did not deal with in this version but we know that we can have as little or as much replication as we want with minimal at
most a factor of two impact on time and it's still more efficient than you can get without relaxation and then we also have the note that it's really only the edge cases where you have a really heavy load at one process that even hits that high bound and a common case every dq will
very quick which is again something you can't do without relaxation because if every dq at every process is trying to get the same element they're all going to need to communicate there is no fast case they're all slow so there's a lot of future work that's
possible so we could take this work on space complexity and keep going do the same thing with other relaxations expand to other data types one thing is to try to not have to send every element to every process and then have most of them discard that message but if it would be nice if we could determine in advance which processes to send it to now in general i don't think that's
knowable because we don't know what's happening elsewhere in the system until we receive messages and so have this kind of wait for more information but then by the time you sent took your action that information would be out of date endless cycle but we may be able to do some sort of approximate or statistical analysis and if anyone's interested in collaborating with me on that i
love more expertise in that area to get a maybe not quite as good space bound but still improved and not sending as many messages or as much data in messages another area so this is kind of part of a larger framework that's trying to take relaxation
use relaxation as a tool to make data type implementations more efficient so password showed they're time efficient here we're showing that we can get space efficiency to whatever degree we want i'm currently working with another student on trying to get some fault tolerance in here
see if relaxation can help with that at all and we'd like to reduce some of the synchrony assumptions so we're in this fairly well behaved partially synchronous model with known bounds on things what happens if it's asynchronous and the goal of all this is to move towards real world systems so that we could actually use these and so there are a couple of avenues there
direct fault tolerance what happens if our assumptions on mess time are wrong how can we handle or tolerate those but the goal would eventually to actually put these in real systems so that they don't have to wait as long to interact with their data because we've shown
theoretically with we've proven that you can have better performance and so we'd like to see that actually happen and again i'd be happy to collaborate with anyone who's interested in building some of these into a system i'd like to thank again demsie wade my co-author he was an
undergraduate student here at bucknell who did a lot of work this was largely his idea and i did most did the framing and presentation i'd also like to thank i'm q jimmy way and shane
starrett who are in my theory of computation class this semester and helped find some references for the final paper if you have any questions follow-ups other ideas please reach out to me about my email i'd be happy to discuss further thank you