Safety: off --- How not to shoot yourself in the foot with C++ atomics
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 96 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/51840 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
NDC Oslo 201624 / 96
2
7
8
9
12
14
19
20
26
28
31
33
38
40
43
45
48
50
51
61
63
65
76
79
80
83
87
88
90
93
94
96
00:00
Type theoryAtomic numberOperations researchFlagConcurrency (computer science)Object (grammar)Network topologyLatin squareExecution unitEuclidean vectorPhysical systemMUDAtomic numberLevel (video gaming)Physical systemProfil (magazine)Operator (mathematics)Type theorySoftware testingCodeComputing platformWechselseitiger AusschlussComplex (psychology)Total S.A.Standard deviationExecution unitBoolean algebraCASE <Informatik>Mathematical optimizationCartesian coordinate systemMeasurementTemplate (C++)PlanningSet (mathematics)Data structurePairwise comparisonPrinciple of maximum entropyQuicksortSoftware maintenanceoutputFunction (mathematics)NumberMultiplication signThread (computing)Scheduling (computing)Loop (music)Pointer (computer programming)Semiconductor memoryINTEGRALAssociative propertyImplementationPoint (geometry)CoprocessorComputer architectureConcurrency (computer science)FlagCompilerSound effectRight angleMultiplicationCausalityFreewareEvent horizonInterior (topology)Mechanism designVulnerability (computing)IntegerFormal languageShooting methodBlock (periodic table)Expected valueTerm (mathematics)Multi-core processorSingle-precision floating-point formatFlow separationGame controllerCountingArmCompilation albumComputer animation
09:04
MIDIMUDFlagOperations researchFreewareAtomic numberDefault (computer science)Constraint (mathematics)Order (biology)Read-only memoryMilitary operationComputer wormVariable (mathematics)Total S.A.SynchronizationConstraint (mathematics)Data storage deviceSemiconductor memoryReading (process)Order (biology)Thread (computing)Variable (mathematics)Structural loadOperator (mathematics)Arrow of time2 (number)BitSynchronizationComputing platform1 (number)BefehlsprozessorCategory of beingWindowPhysical systemLatent heatRevision controlSoftware testingAtomic numberSet (mathematics)Axiom of choiceCASE <Informatik>Disk read-and-write headCompilerData structureLevel (video gaming)Macro (computer science)Cartesian coordinate systemFunctional (mathematics)Different (Kate Ryan album)Binary codeRun time (program lifecycle phase)MultiplicationMereologyMathematicsAdditionConstructor (object-oriented programming)Compilation albumMaxima and minimaQuery languageParameter (computer programming)CoprocessorRing (mathematics)UsabilityDefault (computer science)Type theoryMultiplication signProcess (computing)FreewareAlpha (investment)Medical imagingBlock (periodic table)Computer animation
18:04
Military operationSynchronizationMultitier architectureVariable (mathematics)Operations researchTotal S.A.Structural loadThread (computing)Constraint (mathematics)Directed setHill differential equationPairwise comparisonNetwork topologyData structureData storage deviceReading (process)Arrow of timeSemiconductor memoryThread (computing)CoprocessorStructural loadAtomic numberDifferent (Kate Ryan album)SequenceOperator (mathematics)Variable (mathematics)ConsistencySynchronizationOrder (biology)Constraint (mathematics)Interrupt <Informatik>Multiplication signNP-hardBoundary value problemCASE <Informatik>Interior (topology)Virtual machineQuicksortSingle-precision floating-point formatBound stateNumberCategory of beingData structureScheduling (computing)CodeProgrammschleifeCompilerLimit (category theory)BitPoint (geometry)View (database)Core dumpReal-time operating systemFreewareTerm (mathematics)Insertion lossImplementationFree variables and bound variablesLoop (music)Software testingElectronic mailing listPlanningRight angleWeightComplete metric spaceMixed realityInvariant (mathematics)MereologyComputer animation
27:03
Queue (abstract data type)TelecommunicationCore dumpType theoryThread (computing)Constructor (object-oriented programming)Read-only memoryLoop (music)CoprocessorMechanism designSubsetAtomic numberProgrammschleifeSystem callCellular automatonScheduling (computing)Set (mathematics)Position operatorCASE <Informatik>SequenceThread (computing)Buffer solutionQueue (abstract data type)Loop (music)MultiplicationStructural loadAtomic numberElement (mathematics)Memory managementMaxima and minimaNumberObject (grammar)Bound stateStandard deviationBitWeightMultiplication signCondition numberSingle-precision floating-point formatComputing platformData storage deviceCore dumpVariable (mathematics)Cellular automatonOrder (biology)Exception handlingInterior (topology)TelecommunicationOperator (mathematics)QuicksortData structureOcean currentComplex (psychology)Wechselseitiger AusschlussPointer (computer programming)Pairwise comparisonPriority queueSet (mathematics)Bit rateMechanism designRevision controlRoundness (object)Semiconductor memoryCategory of beingLevel (video gaming)Type theoryService (economics)Computer animation
36:03
19 (number)Data bufferSlide ruleQueue (abstract data type)Thread (computing)Scheduling (computing)Network topologyIdentity managementInternet forumLink (knot theory)EmpennagePointer (computer programming)Electric currentInterrupt <Informatik>ImplementationProgrammschleifeResource allocationLimit (category theory)Cycle (graph theory)Resource allocationDisk read-and-write headElectronic mailing listBuffer solutionSoftware testingThread (computing)FreewareProgrammschleifeLimit (category theory)Cellular automatonQueue (abstract data type)Physical systemImplementationLink (knot theory)Pointer (computer programming)Structural loadCodeTable (information)Arithmetic meanOrder (biology)Computer configurationObject (grammar)Data storage deviceCASE <Informatik>Complex (psychology)Pairwise comparisonSeries (mathematics)Water vaporUniform resource locatorPoint (geometry)1 (number)WeightBound stateElement (mathematics)Arithmetic progressionNumberQuicksortLevel (video gaming)Form (programming)Computer animation
45:03
Network topologyImplementationInterrupt <Informatik>ProgrammschleifeResource allocationLimit (category theory)Cache (computing)BefehlsprozessorThread (computing)Line (geometry)Variable (mathematics)Data structureEmpennageRead-only memorySpacetimeConvex hullHill differential equationIntegerMultiplication signConstraint (mathematics)Physical systemPower (physics)ArchitectureDefault (computer science)ArmOperations researchSynchronizationStack (abstract data type)Line (geometry)Cache (computing)Different (Kate Ryan album)Thread (computing)Buffer solutionDisk read-and-write headInterior (topology)Right angleSoftware testingConstraint (mathematics)ArmOperator (mathematics)Stack (abstract data type)Queue (abstract data type)Graphics tabletCore dumpSemiconductor memoryCASE <Informatik>Sound effectStructural loadAtomic numberData storage deviceMultiplication signOrder (biology)Power (physics)Maxima and minimaCoprocessorData structureBoundary value problemShared memoryPhysical systemProcess (computing)CompilerElectronic mailing listArray data structureCausalityMereologySpacetimeMathematicsAlpha (investment)Product (business)Logical constantQuicksortPairwise comparisonSystem callPointer (computer programming)Link (knot theory)FreewareAlgorithmTouch typingServer (computing)Value-added networkVector potentialImplementationSelf-organizationLevel (video gaming)Position operatorCodeComputer architectureXML
54:03
Thread (computing)CAN busSystem callAddress spaceConsistencyPointer (computer programming)Data structureObject (grammar)Variable (mathematics)Hazard (2005 film)Similarity (geometry)ArmUser profileSoftware testingPower (physics)ArchitecturePersonal digital assistantLimit (category theory)Database transactionOperations researchMilitary operationInvariant (mathematics)Kolmogorov complexityImplementationConcurrency (computer science)Level (video gaming)Default (computer science)Cache (computing)Type theoryAtomic numberFunction (mathematics)CodeMaizeNetwork topologyMechanism designCache (computing)Semiconductor memoryThread (computing)Software testingImplementationDifferent (Kate Ryan album)Queue (abstract data type)CountingData structureView (database)MathematicsStack (abstract data type)Library (computing)CodeProgrammschleifePointer (computer programming)Graphics tabletOrder (biology)Power (physics)Profil (magazine)Operator (mathematics)Disk read-and-write headArmArithmetic meanAtomic numberCASE <Informatik>Object (grammar)Limit (category theory)Content (media)State of matterPoint (geometry)Set (mathematics)Address spaceCoprocessorHazard (2005 film)Commitment schemeSingle-precision floating-point formatFree variables and bound variablesInvariant (mathematics)Term (mathematics)Resource allocationMereologyMultiplication signProcess (computing)Computing platformSlide ruleDatabase transactionLoop (music)Roundness (object)Computer animation
Transcript: English(auto-generated)
00:04
OK, so if you're using C++ atomics, or atomics in any language, then you're really going for the low level. And it's far too easy to do something wrong, shoot yourself in the foot. So we're going to try and show some sort of guidelines that are going to
00:22
help with approaches when you're using atomics to try and avoid shooting yourself in the foot. So to start with, we're going to go through what the types are that we've got, what operations you can do in C++. And then we're going to do some worked examples and then lead up to these guidelines by the end.
00:46
So I'm just going to say, before we actually get into the things, that fundamentally, most of the time, you're using atomic operations rather than locks, because you were hoping that it's an optimization. And like any other optimization you're doing in your system, you want a profile.
01:04
Particularly in this case, because if you're skipping to the lower level and using lock-free data structures and atomics, then there's so much scope for getting things wrong in a way that won't manifest in tests because it's multi-threaded code, that you really, really want to be sure that it's going to be a benefit.
01:22
Because the cost in terms of maintenance complexity and code complexity is so high that you need to be sure it's worth it. So when you're profiling, you need to think about the sort of things that you care about for your application. Is it the throughput, the total number of data items processed per unit time?
01:46
Or is it the latency, how long it takes between when you get some input before you get some output? Or is it something else, something that's specific to your application, some measure that you've got specific? So you need to decide what your measure is and then measure it with the simple reliable mechanism,
02:06
or easier to use mechanism of using locks and so forth. Then implement your hopeful optimization using atomics and then profile again and check that you have actually improved the thing you're trying to do.
02:28
So the atomic types we've got in C++. Fundamentally, we're looking at the standard atomic of T class template. And for this, you can store any type which is bitwise copyable.
02:45
If you can use memcpy on your type, then you can stick it in a standard atomic. Even though if it's a struct, a plain value, whatever, you can do it. If it's one of the built-in integer types, then you get an extra set of operations on it.
03:05
But you get the basic operations, whatever your type is. And one of the operations you might want to do is a compare and exchange. And for this, then we also need it to be bitwise comparable.
03:20
So if you can use memcump on your type to compare values, then you can do your compare exchange. And of course, if you use, well, it can be true for any type. Whatever type you put in atomic T, it might not be lock-free on any given platform. It might actually use an internal mutex under the hood.
03:41
So this is one of the reasons why you might want to profile, because you're replacing a nice high-level mutex that you're in control of with a low-level one that the compiler's in control of, and you just made your code more complex. So you don't want to do that if that turns out to be the case.
04:01
There is only one type in the C++ standard that is guaranteed to be lock-free, and that is standard atomic flag, which is the most cut-down Boolean flag you can possibly imagine, and it only has two operations on. So we'll have a look at it in just a sec. But you can, it is sufficient to build a spin lock
04:21
and therefore everything else on top of it. In the concurrency TS, we've got two new atomic types, atomic shared pointer and atomic weak pointer. Shared pointer and weak pointer not only, you can't copy them with memcpy or compare them with memcump because you've got to deal with reference counting.
04:40
So we've got special types to deal with those. Again, they may not be lock-free in any given implementation, so you need to hope, if you want to use these, you need to check. But if it does turn out to be lock-free, it can greatly simplify your code.
05:03
So why are we talking about atomic? Well, the point is that it's a simple irreducible operation. So operations on our atomic types can't be divided up. Any given thread will only see either before the operation or after the operation. There's no possible middle ground, no scope for races on the actual value,
05:25
no data races, no undefined behavior for your atomic types. So what operations have we got? Well, for everything, we can load our values, we can store new values,
05:40
we can exchange, swap one value for another, and if they're bitwise comparable, we can do compare exchange, which is I've got an existing value, and then I've got an expected value, what I think it might be, and a new value I wish to put in. And it will compare our expectation with what's actually there, and if our expectation matches what's actually in there,
06:03
it will exchange our new value in. And this is a single atomic operation, so you don't have to worry about it doing the compare and then somebody slipping in and changing the values in the meantime. Again, that requires that it's a bitwise comparable type.
06:21
There's two variances, weak and strong. The strong will only fail if the values were different, if the comparison failed, whereas the weak might fail due to running on a weakly-ordered architecture like ARM or Power, and if the OS scheduling meant that the operation got interrupted because things like x86 have a single instruction for this,
06:47
so you can say, I want to do an atomic compare exchange, bam, the compiler just does it, the processor just does it, whereas other architectures don't, and because it's now multiple instructions, your thread might get suspended by the scheduler
07:01
or various other effects across multi-cores might cause the operation to fail by the time it hits the final instruction, so the compare exchange strong then has to loop, whereas the compare exchange weak just says, well, I failed for no apparent reason, so you will have to then decide whether you wish to try again or adjust your values before you try again.
07:22
We have assignment. Now, most types, when you do assignment, then you're assigning from the same value, but with atomics, you're assigning from the contained value, so atomic of t, you're assigning from a t. Atomic of int, you're assigning from an int. You can't copy a sign, an atomic of int, because that now would require atomic orchestration as two values,
07:44
and most implementations can't manage to do that because they're in separate places in memory. So you need to explicitly say, I want to load from that one and then assign to the other one.
08:06
If we've got integral types or pointers, we can do arithmetic. So you can now do a fetch and add, which is add a value atomically to the values that's there and return what the old value was,
08:20
or a subtract, which is the same. Then we've got operators that do this. We've got the plus, plus, and minus, minus, both prefix and postfix variants, and plus equals and minus equals. So you can do atomic operations. These are short hands for doing a compare exchange with the relevant values.
08:44
If you've got an integral type rather than a pointer, then you can also do bitwise operations, which again, so you've got and an or, an exclusive or, and then the operator variants. Yes.
09:01
They'd all be equivalent to the strong compare exchange. So it's possible that on any given platform there might be a nice atomic operation to do it. So if the compiler's being smart on an x86, if you're not using the return value, then fetch add might use an in construction, a locked variant of the in construction to increment the value.
09:21
But if you need the return value, it can't do that because it actually has to do the exchange. And then of course, on our atomic flag, I said we've only got two operations. There is test and set or clear. You can't just set it. You can't test and clear it. You can only test and set.
09:43
And if you want to test what the value is, you have to set it. There is no query it without setting. So there's only these two. It's the bare minimum that was guaranteed to be supported across all the platforms we wanted to support C++ 11 on. And it's enough to build the higher level structures with an internal lock.
10:03
So of course, all this lock might not be lock free, but these ones are. There is an isLockFree member function on your atomic types, which you can use at run time, at startup in your application, to test whether any given atomic of T is lock free on the current running platform.
10:24
There are also some macros that you can use at compile time, which will tell you whether it's always lock free, never lock free, or sometimes. The reason it might be sometimes is because you can compile a single process, a single application that you might be able to use is the same binary on multiple different CPUs
10:43
and depending on the properties of the CPU. So for example, old x86 on 64-bit platforms don't have a 16-byte compare-exchange, whereas new ones do. So you can't then have 16-byte atomics on an old version of an x86 platform,
11:02
but you can on the newer ones. And when I say newer, I mean anything that can run Windows 8. So we're not talking epically new, but it is still something there are, you know, systems in the wild that don't have that operation.
11:27
Okay, so when you're doing your atomic operations, then you can put constraints on them without the memory ordering. Now, there are six different values that you can put. The default, if you don't specify anything,
11:41
and which you get if you use any of the operator variants, you don't have any choice because operators, you can't add extra parameters. And it always uses sequentially consistent, which is the top one, the strongest ordering constraint. If you use one of the function operations, then you can add additional ordering constraints, which are increasingly relaxed in the requirements that they put on you.
12:03
So you can do acquire and release. So acquire is for load operations to say, I want to be able to see changes that were made by the threads to other variables that I'm acquiring the synchronization through this load. Release is the store part of that. So when I do a store with release,
12:22
then that's saying the changes that I've made to other variables, I'm publishing to anyone that's synchronizing through an acquire on this value. Acquire release is the read-modify-write operation version. So if you're doing a compare-exchange, which is both a load and a store, then you use acquire-release to do both halves.
12:42
Though you can only say, I'm only synchronizing one way. I only wanted to release or only an acquire. And then relaxed, which is for expose only, is don't do any synchronization other than on this specific value. So the specific value that I've got my atomic of T that I'm using, then that value
13:01
is going to be safely accessed, but I'm not going to put any ordering with regards to what other operations are visible to other threads. So when you put these on your operations, it controls precisely the details of the instruction that it generates. So whether or not there might be additional fence operations
13:22
on x86, then if you do a sequentially consistent store, then it will use an exchange operation rather than a plain store in order to guarantee that the global synchronization properties are met.
13:40
Okay, now down here in tiny-type, which hopefully you can't actually read, it says memory order consume. Now, this is a special variant of memory order acquire which only synchronizes some values, and there are probably only three people who know how this works, and they keep disagreeing on what it actually does. So Detlev's shaking his head.
14:01
I think he thinks there might not be anybody who knows how it works. Well, if it works how some people think it was, then it would also be applicable to some other systems other than alpha, but there is disagreement, so maybe nobody actually knows.
14:21
So really, really, you don't want to use it. But it's there just to confuse you. Well, if it is, then that's great. Okay, so if you're doing sequentially consistent ordering,
14:42
then sequentially consistent operations, they work how you might think things work, which is really nice. You can, in your mental image, think, I've got all these operations across all these threads, and I'm going to rearrange them into this happens first,
15:01
then that one over there, then that one over there, then this one down here in some nice order. And that's what we get with sequentially consistent operations is some single global total order of all our operations across all our threads. And obviously, you don't know in advance what that order is going to be, but when you look at your variable operations and you look at the values you get,
15:21
you will be able to construct that order. The compiler and the processor and the runtime work together to ensure that there is such an order, and then you can reason about it. So if we've got four threads, we've got one thread over here that is storing a value to some variable x,
15:40
and another thread over there that's storing a value to some variable y. Then we've got two threads in the middle that are reading x and y, but in different orders. So this one reads x and finds that it's value one. So it must be after this store. So that's great. This thread reads y and sees that it's value one,
16:02
and that must be after that store. So that's great. But over here, the second order, we're now loading from x. If we load zero here, now we've got a single global total order, so this load here must be before that, before the store to x in that order, because if you've only got the one store,
16:22
then if you read the value that was there before it, well, that read must have happened before the store. And so then we can think, well, we've got a single global total order, so we can follow the arrows around. We've got our store to y is before our load from y, which is before our load from x, which is before our store to x,
16:41
which is before our load of x. So when we load y over here, we must get one. If this thread reads y is one, x is zero, we must get y is one if we get x is one over here. If you've got single, sequentially consistent operations, there is no other way that this can work.
17:05
So if you're doing release-acquire synchronization, then the synchronization is a bit more limited. It works on specific variables and the operations that are visible to given threads.
17:21
So there's no longer this global single total order, but a specific set of operations and things that were visible to one thread become visible to another. So the specific case that we will start is if we've got a store to some variable x. Now this one doesn't have to be atomic
17:41
for this synchronization to work. We've got a store to some variable x, and then we've got a store to some atomic variable y with a release operation, the memory-order release. On this other thread, we've got a load and we're reading that value from y
18:01
with a memory-order-acquire. Now if the value that we see is one, is the value that we stored over here, then we have a synchronization relationship here. If we don't see the value that was written, then we haven't got that synchronization. It's not there.
18:20
And so then you can subsequently, you will see this value was stored over here. It was before the store with the release. And so when you do the load with acquire, then a read that's done afterwards will see the value from up here. You can follow the arrows. And you get the value that you expected.
18:45
The difference from sequential consistency is that unrelated reads do not synchronize. So we've got our same four threads as we had before. We've got a store to x and a store to y. We've got two threads doing load of x and load of y in different orders.
19:02
But this time, we're doing release operations on the store and acquire on the loads. If we get our load of x is zero over here, it doesn't now have to be a global total order. So from the point of view of this thread here, the store to x hasn't happened yet.
19:20
But from the point of view of this thread over here, it doesn't care what that thread thinks. It thinks the store to x has happened, but the store to y hasn't. So you can now, whereas before, we had that if we get y is one, x is zero, x is one means we must have y is one over here.
19:42
Now we're allowed to have y is zero. Now x is one, y is zero. And y is one, x is zero. So you can get inconsistency from the views between threads. So you really need to think about what it is that you're trying to synchronize if you're going to relax the operations even from sequentially consistent to just acquire release.
20:10
Of course, if you're using relaxed operations, then anything can happen. Even with just the two threads, you store to x and store to y. And if you load from y with relaxed ordering,
20:22
then when you load from x, in this case, if x wasn't atomic, there would be a race. So assume x is atomic. Then you can get any value from x because there's no ordering. The store to y, the load from y, doesn't apply any ordering at all.
20:46
We've also got fences. Two kinds of fences. There's a thread fence and a signal fence. Now the thread fence is what one might always think of as a fence, and the signal fence is specifically for synchronizing
21:01
between a thread and a signal handler in the same thread, which for most purposes you can probably completely ignore. I've just put it on here for completion. If you're doing signal handlers and using atomic operations, then you can read up about signal fence, but otherwise, just ignore it.
21:22
And in C++, whereas if you're looking at processor documentation, then fences might have global properties. In C++, then specifically, fences modify the ordering constraints on the surrounding atomic operations.
21:41
So if we've got a load with a relaxed memory order, and then we apply a thread fence with acquire, then it is as if we've said load with memory order acquire and no fence. Whether the processor generates the same instructions for the two cases,
22:01
it may or may not. The compiler might decide that actually, if you specify it explicitly with a fence, you will get different instructions. So it's not completely the same, but it can implement it identically, if you can see that that's what it's doing. And likewise, if you have a release fence before a relaxed store, then it's equivalent to a memory order release on the store.
22:28
I would say mostly you don't need to use the fences, but there are a few cases, particularly if you think there's only conditional synchronization I want to do. So you might only conditionally do your fences.
22:41
For completeness, an acquire release fence is both an acquire fence and a release fence. So it will deal with stores before it and stores after it and loads before it. And sequentially consistent fences, they fit in your single total order of sequentially consistent operations.
23:02
So they provide a little bit more of an ordering constraint, but it can be quite hard to make it work right, and if you end up really relying on where that fits in your ordering list, then you're probably doing something wrong.
23:21
It's not necessarily a good plan, but it is there. So let's look at some lock-free examples. Why would you use atomics? Obviously, the simple case you can use atomic is you've got a single variable, like a bool variable that you're protecting with a lock,
23:42
and you say, well, I can get rid of that lock. I'll just have an atomic bool instead. So if you were in Andre's talk a short time ago, then rather than having synchronized of an int or synchronized of a bool, then you could just use atomic of an int or atomic of a bool. But if you're doing larger scope things,
24:02
then we talk about lock-free code, lock-free data structures. So we've got three sorts of guarantees that the researchers like to talk about with your lock-free things. First is obstruction-free, which is that if we've got lots of threads
24:22
accessing our data structure, and we pause, the OS decides that they're all going to stop apart from our one. Maybe we're on a single core machine. We're using lock-free data structures on a single core machine. So everything stops except this one. Can it finish its operation? And if it can, if it can finish in a bounded number of steps,
24:41
it's obstruction-free. And that doesn't matter regardless where in their operations the other threads were suspended. If you want to step up on that, and you say, okay, we're not going to suspend any of our threads. They're all going to access.
25:01
And so they're all going to try and compete on our data structure. But they might get in each other's way. So we've got compare-exchange. Compare-exchange might fail because another thread has swapped in the value you were trying to solve, to try and put in, so then you have to loop. And so if you've got multiple threads
25:22
and within a bounded number of steps one of them's going to actually get to the end, then it's now a lock-free data structure. But you don't know. Any given thread might have to keep looping. It might be really unlucky. And that one thread will have to keep looping a million times whilst a million other threads
25:40
get all the way through their operations without interruption. But that's still lock-free. And that's the most common guarantee you're going to see in lock-free code because it's the one that ideally you're going to want because most of the time your other threads are not going to be paused. They're not going to be stopped by the scheduler.
26:00
So obstruction-free will mean that you'll actually end up with things getting in each other's way. You'll get live lock. Whereas the strongest guarantee at the bottom, wait-free, is really hard to get. Wait-free essentially means there are no unbounded loops in your code. And every thread that is accessing the data structure
26:20
has to be able to get to the end in a known upper bound on number of steps. It might be that if it's the only thread accessing the data structure it can get through in three steps. But if there's other threads then it now takes 20. But that's okay because we know there's a top limit of 20. But if there's no known top limit, then it's not wait-free.
26:42
And if you're doing real-time stuff where you've got hard time boundaries then you need wait-free operations because then you can know that my operation is going to complete and within a certain time quota that I can then put into all my calculations of whether or not I can do things.
27:04
So in order to demonstrate some examples, let's talk about queues. Queues are a core facility for communication between threads. There might be many types. You might have single producer, single consumer queues, multiple producers and single consumers,
27:22
multiple producers and multiple consumers, or the other way, one producer and then loads of threads reading values off the other end. You might have bounded queues where there's a maximum number of elements that can be in flight in the queue at a time, or undemanded queues where you just keep allocating off the heap and as long as you've got heap memory
27:41
then there's no bound. You might have various properties, FIFO queues, LIFO queues, priority queues, unordered. And then again, they might be intrusive and non-intrusive so if you want to put it in a queue you've got to actually do something intrusive to your data structure or otherwise.
28:02
And they're core to multi-threaded stuff and they're quite good for demonstrating lots of the issues. So let's start with a lock-based unbounded multi-producer, multi-consumer FIFO queue. And to do this, we're going to wrap a standard queue object
28:21
with a mutex and a condition variable. So to push back, we just lock the mutex, push our value onto our wrapped queue and notify anyone who's waiting that there is now a value. And if we're popping off the front
28:41
then we can lock our mutex, we can wait on our condition variable using our lock and we're checking for the case that the wrapped queue is not empty. And when the wrapped queue is not empty we can then take the front value,
29:01
remove it from the queue and return. Okay, so can we do this lock-free? Okay, well let's start simple. We're going to have one producer thread, one consumer thread, and bounded operations, so there's only a specified number of items in our queue,
29:21
if we get too many then it's full and somebody has to wait. And then we're going to assume that we can copy our values without exceptions because that just makes things easier. And if you look, a queue of int or a queue of pointers, that is the case, so it's not a daft assumption.
29:43
Okay, so we're going to have an array with some entries. Our entries we're going to have, this is basically this complex bit at the top, aligned storage, we're going to say we're going to create a buffer that's big enough to hold and appropriately aligned to hold the values that we want to put in our queue,
30:01
but we're not going to put the values there yet, it's just some storage. And then we've got the atomic build to say is it initialized or not, is there actually a value in this given entry? So we're going to push values on, then we're going to have a position where we're pushing.
30:23
Now we've said that there's only one thread that's pushing, so this doesn't need to be atomic. So we take the current value, we find the entry in the buffer, if it's already initialized then we're going to throw that with full because it means we've wrapped all the way around
30:41
and the threads haven't been taken out yet, values haven't been taken out by the consuming thread. If it wasn't full then we're going to increment the value but with a wrap-round, and then we populate the value and say yeah, this thread's ready.
31:02
So just an aside, this thread doesn't actually do any waiting, it just says if the queue's full then you're going to throw, which means that if you're trying to push on a queue that's full then you've got to do something above it
31:20
because you could catch the exception and try again, but then you end up essentially with a busy wait loop. So we'd like to avoid those if possible, but if there is a danger of the queue being full because your producer-consumer threads are running at different speeds, then you need to think about waiting.
31:41
The only case where you really want to have a sort of loop is if you've got a compare-exchange week and so you're just looping around and you actually want that loop to be really small most of the time because you don't want it to be interrupted so you want to make sure that you complete your operations. But otherwise you need to use an external wait mechanism,
32:00
external to what we're showing here, like a condition variable, which does add complexity but you might be able to use something that's a bit lighter weight depending on your platform. Okay, so then skip over to removing values. Again, we've got a non-atomic position that we're popping from,
32:25
so we're going to check we're initialized in the buffer and if the current entry doesn't have one, then the queue is empty, so we're going to throw. Otherwise, we're going to cast our storage to get a pointer to the T value that's there,
32:45
take the value out, destroy the one that's stored, clear the flag, and move the pointer. It's nice and straightforward, but it's single producer, single consumer.
33:05
If we were going to try and extend that to have multiple producers, multiple threads being allowed to push onto your queue at any one time, then things get a bit more tricky. You might think that you could just make it an atomic on your push position,
33:21
and so we load the value, and then we try and do an atomic increment, but we couldn't just use fetch add because we need to do the wrap round, so we do a compare-exchange-weak loop. You might think that this would be good, but it isn't. It's too naive.
33:41
This is one of the first things that you find. If you're doing lock-free stuff, then the simple case is all nice and good. You try and extend things to more threads, and things go wrong. It's still broken. So the particular sequence of operations
34:01
that can get things wrong, we've got an empty queue. Push pos is zero. So thread one calls push back. It gets the position is zero and increments the position to one nicely. The compare-exchange just works. There's no other competing threads. It checks the cell is empty and gets suspended.
34:24
Then now suppose we've got another thread which pushes back until we've got a full buffer. So the push position is now wrapped all the way back round, and he's now pointing at our first entry again. And thread two pushes then one more item.
34:44
So it's actually trying to push our buffer size plus one item onto the queue, but the first thread hasn't actually put an item in its slot yet. So thread two gets the same position
35:00
and checks that the cell is empty. And it is. It is empty because the first thread hasn't done anything yet. It was allocated the cell, but it got suspended before it could put anything in the cell. So thread two checks that the cell is empty, puts a value in the cell, and it's done.
35:20
But thread one is then woken up, and it says, well, I just checked this cell. This cell is empty, so I'm going to put my value in the cell. And now we have unsynchronized access to a value which is being accessed by multiple threads, and we've got a data race, and the queue is broken.
35:42
And even if that was an atomic value that you were being, then you would have overwritten a value that hadn't yet been removed. So the naive attempt of just making the push position atomic doesn't work. So you might think, well, that only happens
36:01
if we were trying to push buffer size plus one elements because the buffer was already full. Can we prevent this by checking the size for a full buffer? Can we add an atomic size? We load the size. If the size is the buffer size, then we keep looping. We put some sort of busy weight in there
36:22
otherwise we're going to increment because we're going to push. We're adding a new value. But now we're not even obstruction free because as soon as the... If you get thread... In our example, if thread one was suspended with its waiting to put value in its cell, the buffer is now full,
36:41
and no thread would now be allowed to populate that cell because the retrieving thread will get blocked because the cell's empty, and the other pushing threads are blocked because the buffer's full. So if you suspend all but one thread,
37:01
unless the one thread that's awake is the one that's got a cell that it's trying to populate that's holding everybody up, unless that is the one that's awake, then nobody makes progress. This is not even obstruction free. Okay, so can we fix it? And to do that, we need to think about what the problem is
37:23
that we're actually trying to address. And the thing is that pushing a value has three steps. We need to find a free slot, we need to construct the pushed value in that slot, and we need to mark that value as available to whichever thread's going to consume the thing.
37:44
And if we get those in the wrong order, then we don't even get obstruction free or we just end up with data races. So what we need to do is we need to publish the availability of the data
38:03
at the end of step three, rather than acquiring our cell at step one when that cell is now blocked from all the other threads in the system. We need to... Now, if we do that at step three at the end, then we can deal with that.
38:21
So is there a way that we can manage that? And to do that, we need to sort of separate the buffer ordering from the queue ordering, so we need to redo the steps. And so you might think, okay, can we hunt the buffer for a free slot and then construct the pushed value into the slot and then link that entry into the queue? Is there a way of doing that, you might think?
38:43
And fundamentally, the answer is, well, you're going to have to end up using some form of linked list. It makes it the... If you stick with an array and you hunt through for free slots, then you end up with ever such a lot of complexity and then you still now got, okay, well, that was the slot that I added, but which order was it in the queue?
39:01
And now I need to put it on a queue to show which order my buffers were in. No, so let's just use a linked list. So, linked list are easy. No, you just stick a list on the tail and pop them off the head. But we still got two locations to update. The next point node.
39:21
So we've got linked list, which has a node with a pointer to the next node. So we've got the tail pointer that says, this is the last node in the list. We want to add a new node. We need to change the tail pointer to point to our new node, and we need to change the next pointer from the existing one to point to our new node.
39:41
There's two things to update. And whichever order you try and do them, then you can end up with a race. So it's still not straightforward. So one possible option is actually to say,
40:02
I'm not going to update the next pointers. I'm going to update the next pointers from the one and only pop thread. And instead I'm going to make it a doubly linked list with a previous pointer because that we can safely do. So we have our new node, and we say, well, the previous node was the one that's currently last.
40:25
And if we then successfully change tail to point to our new node, then we know that that was the case because we're changing tail from what we think was the previous node to our node. And if the compare exchange fails, then we've got the new tail that we can try again.
40:46
So we allocate our entry by some means. We put our object in our storage. We clear out the next pointer because we don't know what it's going to be.
41:00
We say the previous one is the current tail, and we repeatedly try and update the tail to point to our new node. Now if this fails, then the compare exchange actually reloads the first value, so we don't need to repeatedly call tail.load. So then when we're popping,
41:24
then if we find that there isn't a value at the head of the list, then that might be because, well, we haven't set up all the next pointers yet. So let's go and chase, start at the tail and chase down, filling in the next pointers.
41:40
We've only got one consumer thread, so we don't have to worry about races on that. So we'll look at how you do that in a second, but once we have chased through the tail, then we can then load the next pointer, get the value, destroy the old,
42:01
and then recycle our node back onto, ready for reallocating for the next push. So chasing the tail is quite straightforward. So we take the current tail, and we're going to exchange it. We're going to swap it for null. There's no end of the list because we've just taken ownership of that.
42:24
And then if there really wasn't anything, we're going to return a null and make sure we deal with that in the pop code. But then we can just go through from our tail and allocate point. We've got a node which is pointing back to a previous node,
42:42
and then we just say, well, take the next pointer of this one and make it point that way. And now we've re-linked them back in the correct order, and so then the pop can now proceed with just a simply link list that way.
43:03
So this queue is obstruction-free because there's no one thread that will hold up all the others. Now, if any of the pushing threads gets interrupted, then it's not going to cause a global stop.
43:23
And if the one popping thread gets interrupted, it's not going to prevent any of the pushing threads operating. So it's definitely obstruction-free. But is it lock-free? Now, if it's full, then we're going to have to wait.
43:46
So one thing is that then you might want to use a lock-free allocator instead of a fixed buffer. And if you've got a fixed buffer size, then we've got issues on fullness. And likewise, you've got issues on emptiness. But generally, if you're thinking about lock-free versus wait-free, then you can disregard those
44:01
if you've just got a test. So if it's full, then I throw. I have returned in a bounded number of steps. Of a higher level up, I might have to put some high-level waiting in, but that's not... But we do have waiting and compare-exchange loops, and there's no upper limit, so it cannot be wait-free.
44:21
And then just so... But we also are using compare-exchange-weak. If you're using any compare-exchange, then if you are on an implementation which doesn't do it as a single instruction,
44:42
then that compare-exchange might be an unbounded series of steps in itself. And so then, actually, it's not technically lock-free, if that's the case, and we can only be obstruction-free, because there's never an upper limit on how many threads can make progress.
45:03
So whether things can ever be lock-free really does depend on our implementation of compare-exchange-weak. But as an algorithm, then it's a lock-free algorithm, assuming that the steps are individually.
45:21
Okay, so... One thing to think about, then, is how does it perform. And one of the big issues with lock-free stuff is cache ping-pong, where you've got... One processor is accessing a shared value
45:42
on some cache line, and then the other processor wants to modify either the same atomic variable or one right next to it that's on the same cache line. So it needs to grab ownership of the cache line over to the other processor. And then the first one tries to do it back again, and so it has to take the cache line ownership
46:02
back over here. If you've got different processors in the same system with different cache, or even if they're on the same core but they've got a different level-2 cache, then you can find that that can be a real issue, and the cache line is continuously shuttled back and forth between your two processors,
46:24
and you've got cache ping-pong, it's called, and this can have a big effect on your performance. Now this can happen if you've got the same atomic variable that you're doing, but alternatively, others on the same cache line. And that can be a big issue,
46:41
because cache lines might be 64 or 128 bytes. So if you've got a load of atomic ints right next to each other, then actually they're all on the same cache line and you might end up having performance issues with that.
47:03
Okay, so this queue that we were looking at has got many threads that can call pushback, and one more that can be calling popfront. So if we look at the data that we've got, we've got a push position, we've got our head pointer,
47:24
we've got our atomic list of the tail, which is the tail of our linked list, and then we've got our buffer that we were using. Now, the head and tail are right next to each other here,
47:41
so they're probably on the same cache line, but they're accessed by different threads. So even though accesses to head and tail are not going to race in any way because they're accessed by separate threads, it's going to shuttle the cache line back and forth. The pushing threads are going to be updating tail
48:02
and the popping threads are going to be updating head, but the cache line is shuttling back and forth. And there's quite a few potential places in this data structure that that could be. If your entries are small enough, then each of the individual entries might end up being on the same cache line, and then the entry accesses can also cause cache ping pong and so forth.
48:29
And the way that we deal with cache ping pong, if we don't want to change our structure, is just to stick some padding in, great big wodges of padding, char arrays that we're not going to do anything with,
48:42
but that pad out your data structure so now there is a whole cache line between each of the values, and you're not going to get cache ping pong due to accesses to separate values. Obviously, the threads pushing values are all going to access tail, so that's going to have to be shuttled back and forth for that,
49:02
but that's a core part of the algorithm, whereas the ping pong due to the popping from the head, then we can avoid that by sticking the padding in. And this trades memory space for performance. Yes.
49:22
Not currently, but there are proposals for it. I think there's going to be two values, which if I remember rightly is constructive interference and destructive.
49:44
So it's like the minimum value of spacing that you have to put things to guarantee they won't touch, and the maximum value size that you can put together that guarantee will be on the same cache line, because sometimes that's important that you want things on the same cache line, so there's two values for the two things.
50:00
Yeah, so if you want cache line boundaries things, the alignments of the individual values will be nicely aligned
50:23
so the accesses work, so it doesn't matter what the padding size is for that, but if you've got things like this next to some other value in memory, then you want to make sure that it starts on its own cache line and doesn't share with something that's previous in memory.
50:41
So you might want to take the whole structure and make sure that it's aligned to a cache line size. But yeah, so if you don't have the nice new constants, then picking a value like 128 for your padding size is pretty likely to be good enough for most systems. If you know where you're targeting, then you could look it up, but cache line sizes do vary,
51:02
because the processes and the memory architectures change all the time, and they say, oh, well, actually, no, 64 bytes works really well. No, 256, that's what we need, and depending on the details of the processor, then it might vary. So if you know what you're targeting, you can look it up. If you've got a compiler that provides the new constants, you can use that.
51:24
Otherwise, take a punt on 64 or 128 or something like that. So if we stick those in, does it make a difference? Yes. I ran 10,000 times, 10 million times,
51:45
pushing values on three threads and popping value on a fourth thread because I've got a four-core system I was testing it on. We timed it against the lock variant that we had right at the beginning. So with no padding, we had comparable times to the lock-based one,
52:03
but if you stuck the padding in, then it was really quite a significant difference. So that cache ping pong is a real effect. It's not just marginal. That's half the time. So far, we've all been using memory auto-sequentially consistent,
52:28
and I'm just going to say you really, really ought to unless you have a strong reason not to. It's possible to edge out some corner cases of performance by relaxing your constraints, but no, A, don't.
52:43
B, if you really, really want to profile and profile some more and review your code for like three weeks solid, even if you've only made one change, convince yourself that it's correct and then profile some more and get somebody else to review it. If you're going to relax constraints, you need to be sure.
53:02
And also, test on a weakly ordered architecture. If you're testing it on x86, the memory order underneath is really strong. Test it on ARM or power, or if you've got access to an alpha, then use that. But if you're relaxing in memory ordering, then it's very easy for things to work in test and then fail in production.
53:26
So yeah, on x86, only the store really is affected, but power and ARM then actually can affect all the operations. So yeah, make sure you test on a weakly ordered system if you're going to try and relax the ordering constraints.
53:43
Okay, quick dive through stacks. Only got a few minutes left, so it's going to be speedy. A stack is a simpler data structure than a queue. It's great for examples, but mostly you're going to reuse it in multi-threaded code because if you've got multiple threads accessing,
54:02
everyone's piling on the head. There's only the one point of access, accessing the top of stack. But we're going to use it for a specific demonstration of a specific problem, the ABA problem. So assume we've got a stack which has got nodes with next pointers.
54:22
We've got atomic value to hold the head of our stack. Create a new node, compare exchange to stick it on. We can just loop around. It's nice. If we've only got one consumer popping, it's all nice. We can just take the head and do a compare exchange.
54:42
If it works, then we take the value and delete the old head. The new values are pushed on. There's only one thread popping it off. The pushers are not accessing the contents of the nodes, so we can safely delete them.
55:02
But it's a single consumer for a reason, and the answer is this ABA problem. The ABA is you've got a value in your data structure that holds value A. You read it, it's A. Some other thread, then your thread gets suspended, and other threads do stuff. It gets changed to some other value B
55:23
and then changed back to A again, but now that A has a different meaning from your data structure. So when you look at it, it's A still, but it now has a different meaning, and when you then make your operations based on that, you do something wrong. So in this particular case,
55:41
if we then said multiple threads can pop, what's going to happen? Thread one calls pop. It reads the old head and gets a value A. It reads the next pointer from its old head and is then suspended. Thread two pops two items. The head now has a new value, which you're going to call B,
56:05
and then we push two items, but because we just freed some values, then the allocator says, well, I'm going to give you the same object that you just freed, because often it does that. If you call new and then call delete and then call new again with an object of the same size,
56:21
you might well get exactly the same pointer value. When we stick the items on the queue, the new one that we just allocated has the same address as the old one we just freed, but it's a new node. So head now has the original value of A,
56:42
but it's a new node, A, and not the original. So when our first thread resumes, its compare exchange will succeed because it's got the same value that it's expecting, but it's now referring to a new node.
57:01
So the next pointer that it loaded up here is now pointing off into a deleted node, and when we change the data structure to point to our deleted node here, which is just a broken data structure now. So rather than having the correct setting that we wanted,
57:25
we've just messed everything up. And that's the general setup. We have value changes from A to B and back to A, but the global state means that that A now is based on a wrong view of the global data structure,
57:45
and so everything goes wrong. Quick solution is to use a struct. So we stick in a count value. You increment your count every time, and then you know that it might have the same pointer value, but it's got a different count value, so that's okay.
58:02
We can spot that the difference is there. Alternatively, you can ensure that things are not recycled whilst they're still accessible. If you've got an atomic shared pointer implementation that you can use, then use that. If you can use hazard pointers, do that. It's possible to do other things,
58:22
but it can be quite tricky. Things like this, it can still be lock-free, so like I said at the beginning, on 64-bit x86 platforms, this is a 16-byte structure. It may or may not be lock-free, depending on precisely whether your library supports it
58:41
and whether or not the processor that you've got supports it. On different platforms, then whether or not that lock-free will vary. Okay, so what guidelines can we pull from all this? First one I'd say is don't use atomics unless you have to. Profile your code before and after.
59:02
Make sure that this really is a beneficial change that you're making. Firstly, is it the bottleneck of your code that you're actually wanting to update? If the data structure you're trying to update isn't actually the bottleneck, don't do that, do something else. But if it is, then profile it before,
59:20
profile it afterwards, make sure you've actually improved things. Make sure you test on a weekly order of architecture, such as power and arm. And again, don't use atomics if you have to. And as a final one, which isn't on the slide, just a reminder that if you pad things out where there's contention, then that can actually have a dramatic performance improvement.
59:45
So, more general guidelines. If you think in transactions, then that can improve things. If you do your work off to the side and commit with one atomic operation, there's now only the one operation to think about.
01:00:00
So that can have a big benefit in terms of making sure that things are working correctly. If your operation is big, then if you can split it into smaller steps which can in themselves be done as an atomic operation, possibly off to the side and then committed, then that can make things work.
01:00:24
It can help you retain, if you can manage to retain the data structure invariance between each step, then that can be good. So there's various things you can do with inserting marker values that says, I'm not a real value yet, but I'm going to be here as a placeholder. So if you're a reader thread, just skip over me for now.
01:00:43
And then when the value really is then set as a separate operation. Then obviously we saw limit your use cases. If it only works with the single consumer thread, say it only works with a single consumer thread and just declare that as part of your documentation.
01:01:01
And then if people try and use multiple consumers, that's their tough. But then if that's something you need is the multiple consumers, then you can't limit that and you've got to think a new approach. And watch out for the ABA problems. The chances are that it won't happen in testing as ever because the things
01:01:22
have to align just right for you to actually get the same value back again. But if this is a heavily used thing, then it will happen at some point. And undoubtedly with any of these things, it will happen at the worst possible point. You're trying to demonstrate to a crucial customer and it will all go wrong.
01:01:43
Avoid your cache ping pong with padding. Use sequentially consistent memory ordering unless you really, really, really have a strong idea of what it is you're trying to achieve. And you really want the performance gain. And then package up your whole lock-free structure and then use it from the outside through an API.
01:02:06
So it's just a queue as far as anybody else is concerned. And then you can fiddle with the implementation, make the implementation thoroughly tested and thoroughly reviewed. And then switch, but from the outside world, you just use it as a queue and it might as well be a lock-based one.
01:02:26
Aim for lock-free code rather than obstruction-free. If you need it, then you go for wait-free code, but it's really, really complex. Because trying to avoid any unbounded loops is actually really hard. And don't use busy waits unless you have to.
01:02:43
So I've got questions. We're over time as it is, but if anybody does have any questions, then feel free to talk to me. So, I'll stop there.