We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

The Computer Science behind a modern distributed data store

00:00

Formal Metadata

Title
The Computer Science behind a modern distributed data store
Subtitle
The science it takes to build a multi-threaded, clustered and highly performant database
Title of Series
Number of Parts
94
Author
License
CC Attribution 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
The science it takes to build a multi-threaded, clustered and highly performant database What we see in the modern data store world is a race between different approaches to achieve a distributed and resilient storage of data. Most applications need a stateful layer which holds the data. There are at least three necessary ingredients which are everything else than trivial to combine and of course even more challenging when heading for an acceptable performance. Over the past years there has been significant progress in respect in both the science and practical implementations of such data stores. In his talk the audience is introduced to some of the needed ingredients, address the difficulties of their interplay and show four modern approaches of distributed open-source data stores. Topics are: – Challenges in developing a distributed, resilient data store – Consensus, distributed transactions, distributed query optimization and execution – The inner workings of ArangoDB, Cassandra, Cockroach and RethinkDB The talk will touch complex and difficult computer science, but will at the same time be accessible to and enjoyable by a wide range of developers.
ComputerOpen sourceFreewareOpen setModemDatabase transactionACIDGoodness of fitMultiplication signMaxima and minimaDatabasePhysical systemComputer networkMedical imagingDatabase transactionPhysicalismNeuroinformatikIntegrated development environmentConcentricCivil engineeringInstance (computer science)Point (geometry)Entire functionSurfaceSingle-precision floating-point formatXMLUMLLecture/ConferenceComputer animation
ACIDDatabase transactionGreatest elementMotherboardImplementationAlgorithmBefehlsprozessorMultiplication signPhysicistNeuroinformatikDifferent (Kate Ryan album)Line (geometry)CASE <Informatik>1 (number)Computer scienceComputer animation
Different (Kate Ryan album)MereologyPhysical systemDatabaseCartesian coordinate systemData storage deviceScaling (geometry)Process (computing)Machine codeSoftwareTask (computing)Table (information)SequelRight angleCASE <Informatik>Statement (computer science)MereologyCrash (computing)Goodness of fitData structureComputer networkComputer animation
Open sourceSoftwareDifferent (Kate Ryan album)MereologyPhysical systemComputer networkEmailNeuroinformatikConfiguration spaceInstance (computer science)Computer fileInformationComputer networkQuicksortFile systemFiber bundleCASE <Informatik>Stack (abstract data type)Multiplication signOperator (mathematics)Lecture/ConferenceComputer animation
Computer networkScale (map)SoftwarePhysical systemDifferent (Kate Ryan album)MereologyGroup actionCommunications protocolOrder (biology)Complex (psychology)MiniDiscEntire functionFinite-state machineData centerVirtual machineReal numberComputer scienceWeb pageCommunications protocolMultiplication signDecision theoryLaptopPhysical lawNeuroinformatikComputer animation
Open sourceSource codeAlgorithmCommunications protocolGoodness of fitFundamental theorem of algebraImplementationHypothesisStudent's t-testIterationEntire functionFamilyCombinational logicYouTubeFrequencyMoment (mathematics)Lecture/Conference
Communications protocolOpen setFreewareImplementationNeuroinformatikHypothesisCommunications protocolMoment (mathematics)Multiplication signComputer networkScaling (geometry)KorotationReading (process)MereologyPhysical systemService (economics)Proof theoryImplementationSoftware bugFault-tolerant systemMaxima and minimaBitVideo gameComputer scienceCompact spaceOperator (mathematics)WordCoprocessorUniform resource locatorXML
FreewareOpen setIRIS-TSlide ruleRight angleDatabaseEvent horizonPhysical systemComputer networkFinite-state machineMoment (mathematics)Configuration spaceOrder (biology)NeuroinformatikMereologySubject indexingNumberElement (mathematics)RandomizationCASE <Informatik>Server (computing)SequenceResultantWord
Open sourceRule of inferenceOpen setServer (computing)Slide ruleMultiplication signMaxima and minima2 (number)Bit rateRandomizationMoment (mathematics)LoginClient (computing)Latent heatInformationOrder (biology)QuicksortFinite-state machineModal logicComputer networkSkeleton (computer programming)DatabaseServer (computing)Virtual machineService (economics)Lecture/ConferenceComputer animation
Open setHeegaard splittingBitComputer networkMereologyCompact spaceCommunications protocolKey (cryptography)Data storage deviceOperator (mathematics)Multiplication signXML
Similarity (geometry)Drum memoryEvent horizonOnline helpTwin primeNormed vector spaceMiniDiscMaizeNewton's law of universal gravitationSummierbarkeitGame theoryWeb pageBookmark (World Wide Web)Probability density functionVacuumWindowMenu (computing)Duality (mathematics)Latent class modelMIDIGamma functionOpen setData storage deviceComputer networkStructural loadNeuroinformatikServer (computing)MereologyInformationCircleMessage passingSpherical capMultiplication signTheorem1 (number)Right angleConsistencyVotingCASE <Informatik>Data modelACIDPoint (geometry)DatabaseWater vaporRoundness (object)Subject indexingBitMathematics2 (number)Video gameMassDependent and independent variablesPartial derivativeComputer animation
View (database)EmailFloating pointVisualization (computer graphics)Menu (computing)Web pageIRIS-TMach's principleBoom (sailing)Cone penetration testComa BerenicesWechselseitige InformationNormal (geometry)Multiplication signNeuroinformatikReading (process)InformationBlock (periodic table)LoginServer (computing)Communications protocolSource codeComputer animation
Open sourceFreewareSubject indexingAlgorithmComputer hardwarePairwise comparisonCore dumpRead-only memoryReal numberMathematicsProof theoryQuicksortUniverse (mathematics)NeuroinformatikBefehlsprozessorMultiplication signDivisorChemical equationOrder of magnitudeComputer hardwareAlgorithmPairwise comparisonSemiconductor memoryCore dumpLecture/ConferenceComputer animation
Memory managementAlgorithmQuicksortParallel portBefehlsprozessorQuicksortCommunications protocolSemiconductor memoryAlgorithmVector spaceMultiplication signCache (computing)Set (mathematics)Stack (abstract data type)Buffer overflowBand matrixComputer animationDiagram
Memory managementAlgorithmQuicksortParallel portCache (computing)BefehlsprozessorMultiplication signChemical equationArithmetic meanVector spaceRule of inferenceData structureBinary treeMultiplicationSemiconductor memoryAlgorithmSelf-balancing binary search treeMemory managementPairwise comparisonPrice indexCASE <Informatik>CoprocessorProcess (computing)Machine codeRight anglePhysical systemDatabaseQuicksortMechanism designSubject indexingField (computer science)Inverter (logic gate)Key (cryptography)Data storage deviceCombinational logicLocal ringCore dumpOpen sourceComputer animationDiagram
Set (mathematics)Network topologyData structureEntire functionThread (computing)Mechanism designAlgorithmÜberlastkontrolleSet (mathematics)Process (computing)NeuroinformatikInsertion lossCASE <Informatik>Right angleMultiplication signSoftwareComputer animation
Level (video gaming)Insertion lossMultiplication signMiniDiscDifferent (Kate Ryan album)Block (periodic table)Level (video gaming)QuicksortNetwork topologyKey (cryptography)Computer animation
Level (video gaming)Bounded variationSet (mathematics)Key (cryptography)Touch typingVirtual machineInsertion lossMultiplication signMiniDiscParameter (computer programming)InformationComputer fileData structureLevel (video gaming)Greatest elementCompact spaceQuicksortAlgorithmRight angleDatabaseSummierbarkeitComputer animation
Computer fileTable (information)Level (video gaming)AlgorithmData structureLevel set methodInformationComputer networkLocal ringVirtual machineException handlingSorting algorithmPattern languageResultantServer (computing)Web pageComputer animation
Level set methodTable (information)Level (video gaming)SynchronizationSystem programmingVertex (graph theory)Theory of relativityGoogolEvent horizonConflict (process)Computer networkTimestampLink (knot theory)CausalityMultiplication signMaxima and minimaReplication (computing)Message passingRight angleSkewnessTime zoneGeneral relativityComputer networkServer (computing)Differential equationField (computer science)SynchronizationNeuroinformatikOrder (biology)Atomic clockGoodness of fitLocal ringXML
Event horizonConflict (process)Computer networkVertex (graph theory)System programmingSynchronizationTheory of relativityGoogolMultiplication signServer (computing)Optical disc driveRight angleReal-time operating systemBusiness clusterDebuggerMereologyPoint (geometry)Office suiteCycle (graph theory)Order (biology)Communications protocolNeuroinformatikOnline helpPhysical systemComputer networkProjective planeCuboidRun time (program lifecycle phase)Virtual machineMaxima and minimaHeat transferReal numberFigurate numberComputer animation
Computer networkImage resolutionSoftware crackingSynchronizationMessage passingComputerLocal ringEvent horizonCausalityMaxima and minimaHybrid computerAlgorithmSlide ruleServer (computing)Software bugOrder (biology)InformationRight angleLogicCausalityFocus (optics)Maxima and minimaVolume (thermodynamics)Logische UhrXMLComputer animation
LogicMessage passingCausalityLocal ringComputerSynchronizationEvent horizonMaxima and minimaLimit (category theory)Database transactionACIDConsistencyComputer networkAnnihilator (ring theory)NumberServer (computing)Imaginary numberComputer scienceACIDConsistencyDatabase transactionMultiplication signDatabaseMoment (mathematics)PhysicalismAtomic numberPhysical systemPoisson-KlammerSet (mathematics)Entire functionMereologyArithmetic meanResultantComputer animation
ConsistencyDatabase transactionACIDCorrelation and dependenceMereologyDatabase transactionOperator (mathematics)Hacker (term)ConsistencyReading (process)CASE <Informatik>Server (computing)Network topologyEntire functionDatabasePower (physics)Right angleComputer animation
Verteiltes SystemVertex (graph theory)ConsistencyDatabase transactionACIDFreewareOpen setAtomic numberSimulationType theoryNeuroinformatikComputer networkRight angleQuicksortPhysical systemDatabase transactionXML
Verteiltes SystemVertex (graph theory)ACIDDatabase transactionMultiplication signDatabase transactionMiniDiscPhysical systemSystem administratorServer (computing)Virtual machineRollback (data management)DatabaseInformationPoisson-KlammerAtomic numberComputer animation
Verteiltes SystemVertex (graph theory)Database transactionACIDMultiplication signPhysical lawArithmetic meanRevision controlMoving averageSocial classAbstractionData storage deviceDatabase transactionMechanism designMereologyDatabaseGame controllerCategory of beingRight angleComputer animation
Electric generatorElasticity (physics)Table (information)DatabaseACIDDatabase transactionDatabaseSource codeInternetworkingVirtual machineOpen sourcePolygon meshRobotNP-hardComputer networkComputer animation
NebenläufigkeitskontrolleMultiplicationRevision controlACIDDatabase transactionScale (map)Revision controlMultiplication signGame controllerCartesian coordinate systemDatabase transactionMultiplicationReading (process)ACIDMiniDiscOcean currentConcurrency (computer science)Replication (computing)TimestampComputer animation
TrailBlogCommunications protocolGoogolLink (knot theory)Slide ruleLink (knot theory)Partition (number theory)DatabaseMechanism designWeb pageAngleElectronic mailing listComputer animation
Open setFreewareOpen sourceSource codeCommunications protocolTrailBlogGoogolLink (knot theory)Limit (category theory)Normal (geometry)Validity (statistics)Machine codeServer (computing)Finite-state machineMereologyBitComputer programmingData storage deviceVariable (mathematics)Multiplication signAngleLecture/ConferenceComputer animation
Front and back endsServer (computing)Communications protocolCASE <Informatik>Multiplication signPower (physics)Similarity (geometry)Expert systemParallel portLecture/Conference
LogicTrailBlogGoogolLink (knot theory)State of matterLimit (category theory)Database transactionRevision controlPoint (geometry)InformationNatural numberMereologyBlock (periodic table)ChainBusiness reportingSampling (statistics)Artificial neural networkComputer animation
Communications protocolTrailBlogGoogolLink (knot theory)FreewareOpen sourceCartesian closed categoryEvent horizonLaptopServer (computing)2 (number)Pairwise comparisonComputer animationLecture/ConferenceJSONXMLUML
Transcript: English(auto-generated)
Good afternoon, everybody. This is quite what I would call enthusiasm at this time on a Sunday afternoon. Also with this title, which can only be meant as a joke, because it's impossible to cover what is necessary to do that.
But anyway, looking at the name dropping here, it becomes really hilarious, because all of these subjects are obviously by themselves
filling entire workshops and conferences and so on. But we will do some scratching of the surface of all of these. I'm going to take a couple of minutes more than usually for this talk that's been given a couple of times already on conferences by my colleague Max Neunhofer.
To spend and go and dive a little deeper into resilience and consensus. But anyway, these are the subjects, we're talking distributed databases. And what I'm not going to touch upon is what it takes to already create a good database.
So when I joined ArangoDB two and a half years ago, my background is theoretical physics, and then PhD in medical imaging and so on, I did really fancy stuff, as I thought. Doing computations on high performance cluster environments, the reconstruction of medical images that were completely lost for diagnosis,
and we made them shiny and nice and diagnostic again. And then I joined ArangoDB and I thought, these people, I can teach something. And that was a very, very hard landing, as you can imagine. Databases are probably, after operating systems,
the toughest business in the trade. The amount of work and the concentration that you have to put in a single instance database is already insane. And we are now going on the network and doing all the fun stuff that is really going to, has to fail at some point.
All right. Again, consensus, really fun, really interesting subject. And then we're going to visit one of the oldest businesses of the business, the sorting. Then I'm going to do some name dropping, the last three subjects,
we just look what it means, what it achieves inside ArangoDB, what it can do, and why they are necessary if you want to do things like cluster-wide transactions and so on. Okay. Yes, the bottom line is the bottom line here in this case is that you actually do computer science.
None of the stuff that I'm going to show you is made by us or published initially by us. We have implementations, maybe better ones, maybe worse ones, but we have implementations of these and we really had to go through the hard business of making them work.
Lots of cases you will see algorithms just very broadly, scientifically explaining things, but then at some day you will end up on a main board with a CPU and RAM connected with copper to some switch and so on. And then the whole thing becomes an entirely different ballgame
and time suddenly matters and is not linear it seems at times and so on. This is from a physicist. Okay. Modern data store is distributed. Nobody wants to run a database anymore on a single engine and assume that the day will come when it cannot scale out anymore.
The application will suffer, get slower, maybe even crash. I should have maybe said it the other way. Most of the cases you will probably have a crash rather than just becoming slow. When you become slow, you have done a really good job coding already.
Okay. Why is it important that different parts of the software need to agree on things? Let's say you are starting a database and you are creating within your database software a dedicated database for a specific task and then you have tables or collections like in a SQL database like ArangoDB.
And this whole knowledge about the structure of stuff inside the database, we always assume as a given, right? We just create a database and assume it's going to be there the next day and we create a collection or a table, we assume the table is going to be there when we come to the next select statement
or in our case an AQL statement. The thing is that this is not trivial at all on a network. So if you have eight computers or 1,000 computers that need to agree on a certain configuration of the entire thing, this is already anything but trivial. So on a single instance you would have a configuration file
or you have some sort of a file where this kind of information is stored and as long as nobody really breaks the file system, nobody deliberately deletes that, it has malicious intent if you will, you can depend on that. On a network, there is no such thing as a ground truth unless you create one.
Technically you would think to yourself, we are on a network, we have a lot of computers, we can just pass around the stuff. So in this case we even don't only create ground truth but we've even replicated it. So in case somebody goes, we still know the ground truth.
Well, the thing is that it turns out that it's not so easy after all because on a network a couple of things can happen. You can have outages at any time. How often do you want to send something that is supposed to go to everybody? Something that is on the network actually current or is it something that was sitting on a TCP IP stack
of a computer that was detached from the switch because the port was flailing or whatever? Is this thing coming in five minutes later? Does this still apply? Let's go to a bank account. Let's say you have three operations. Add 100 euros, check if the 100 euros are there
and then deduct 50 of them or something. What is obvious is you cannot take the money off before the 100 went in. So these kind of things really matter. The order of which you create your truth is also important. The answer is that somewhere you have a variable A equal to 0 or to 12 or something
but you probably have a state machine of some real complexity. Packages are dropped, are delayed, are duplicated. Disks fail, of course. I'm sure everybody has had all of these so far. Machines fail, racks fail, entire data centers fail.
Yes, and we haven't even talked about things that just happen maliciously. By the way, not that we could solve that problem. It's on a different page, I assume. There is one protocol that has been discussed since ages. It's 1989 is almost as...
It's the beginning almost of computer science, you would say. Leslie Lamport, who is now with Microsoft Research, published a paper that is really fun to read. At least a couple of pages into it. It's taking a monk monastery, a picture of a monk monastery,
somewhere on the Greek island of Paxos, where a lecture of monks is convening at all times, really, during the day to make decisions. And those decisions are then written down in notebooks of monks and now, after a couple of days, you want to know, well, what is the current law, really?
And that's the way the paper is written. I think initially, when he was giving the first talks, he was dressed up like a Greek monk. People didn't understand the humor, so he stopped doing that. But it's really fun paper to read. And then it took ages until 2013.
Diego Ungaro wrote his PhD thesis. And the funny story behind that is that his professor tells him, um, look, there is this consensus algorithm out there. Why don't you go, it's called Raft, and there have been a couple of publications.
By the way, who does consensus for a daily business? Anybody here? You? Okay, at least one person. This is more than usual. People just take it for granted, seriously. I will come to that, you will be surprised how much brain you've got to invest in, this tiny little detail.
Anyway, so a student goes off, they meet a couple of days later, he tries to explain in the paper. And like three minutes into the discussion, he says, stop this, I don't get what you just said. How does this work? And he said, I think you have to pull the first and then press the button and whatever.
He explained, he said, I don't get this. I said, okay, now when I think about it, I don't understand it either. They tried multiple iterations and came up with the idea, we've got to solve this problem for good. And when you listen to, there are nice talks out there. Go on YouTube, you will find the original talks that were given to the subject.
The only goal was make Raft simple. That was very seriously, that's what they say at least today, that the entire design goal was, anything that we decide how the protocol is supposed to work must be easy to understand or we don't do it, period. Then we just don't invent a new protocol.
Okay, let's go through that in a moment. After I have just also maybe explained that the original Paxos implementation is not technically an implementational paper. It's just discussing the matter and explaining how it should work and how it could work. There have been a lot of publications after that
and so Paxos has become effectively a family of protocols, really. You see things like m-Paxos, e-Paxos, s-Paxos, I don't know, whatever combination of characters and Paxos that are out there. They all try to solve the issue that the original paper as good as it is and as well as it proves the fundamentals
is not really an implementational paper. Raft is, again, completely different, supposed to be simple. The protocol is explained in complete detail in the original publication and a lot more detail in the PhD thesis
which is also available online to read. And also, before anybody starts running out the door implementing Raft for something that they do on the network, this is a design bottleneck. There is a leader to the whole outfit, we will see that in a moment and this doesn't scare. So you can't take Raft and throw a hundred computers at it
and make it, say, 50 times faster. Yeah, that's not going to happen. The hundred computers will become as slow as one computer. But I will explain in a moment why that is and how it is still useful. Yeah, so Max's idea is, and I left this in there,
although I do disagree a little bit and I will say in a moment why. And he always says, I'm going to give you an advice although nobody likes an advice. But seriously, it's really fun. Read Paxos in bed or wherever you have some time and seriously don't implement it, that's true. Okay, because it's just old, it's interesting
because it's part of computer science, a significant decade, two decades, it was out there alone. And then take Raft and if you like, actually implement it. I'm not so sure you shouldn't do it,
but I'm going to tell you something. The moment you want to really operate a service on top of it, it's going to become really not so much fun anymore because you will find out that you need a lot more than that actually in the original paper. You need to take care of things like compaction, you need to take care of things like the fact that the world is, and we don't have real-time systems,
but we have time-shared systems with a lot of processors, there's a lot going on, there's maybe a lock rotation going on somewhere else, there's a real network attached to your computer and so on. So you're trying to take a proven, there's a formal proof to Raft in the paper, a proven protocol and try to actually implement it in real life.
It will remain tricky. It took us way longer than we thought. Originally, we thought four weeks, it ended up something like 12 months or so. And then the bug fixing that went on and on. And we just recently, as Ewa just mentioned in this talk, we just recently, by the grace of the Kubernetes operator,
found significant bugs still in the agency, which is the Raft implementation in ArangoDB. Okay, how does this work? I'm going to flip through all these. We have a bunch of servers.
And what is really important to us is not the state machine by itself, but it's more the order of instructions that came in to create that state machine. So if we take my bank account, we really are interested in the order of everything that happened to it during the last year, in sequential order.
There is no payment of the company on my bank account before some other event happened that actually we drew that money again, and so on and so on. This is really important because you might just go ahead and change your state machine in your state machine value
by an increment of one. So if the value before the increment of one was wrong, your result after that will also be wrong. Just by doing an increment, in other words, you're not fixing it down the road accidentally. While an assignment would do that, for example, right?
If you would set X to 100 because it's not assuming anything before, it's going to be correct still. But you really can't work like this, right? Just imagine for a moment, I just set configuration and so on, you would just drop a database because it's no longer in that state machine. So the order of the events is important.
Everything is replicated to everybody in the same order. That is, if you go on every one of the computers that are part of this consensus, they will all have all the instructions with the same index, by the way, in the same order, in their RAM and persistent somewhere to disk so that after a disaster or after an outage or whatever,
you can restart your cluster again, obviously. The first thing that happens really in the system, in a roughed network is you need to find a leader. Before you don't have a leader, nothing is going to happen. This is after every restart, the same story over again. After every failure of a leader, the same story over again.
The way usually you would do that is actually, well, is find a way of not having everybody try to rush for the leadership, but bring in some order, some element of randomness or so to it. In this case, what happens is,
once you haven't heard for a little while from a leader, or you just started and there was no leader, you will reset a local clock to some random value between a minimum and a maximum, let's say between one tenth of a second and a second. You will wait that random time. Once your time is over, you start a leadership.
There is a very nice toy to play with and get a grasp of how it works. We will play with that in a moment. I'm not going to give you too many details on that because it's much easier to see. Once you have a leader, the leader is ready to append instructions to its replicated log.
You can give it all sorts of information. We are creating a database in this huge cluster of ours. Somebody is writing all the necessary skeleton information of that database into our agency, into the replicated log. The DB servers out there will see that, do some stuff, write their own stuff into that, and so on.
That's evolving. This is the state machine. This is the replicated log that's going on. Something becomes the truth and you can respond to a client who wanted to write in it. Once you have spread that information, that specific information in that specific order
to a majority of the other servers. Which means that if you have three servers, then that majority is fixed to two. If you lose one server, it will remain two. You're still able to operate, but if another one fails, you're out. You can't answer because you can't know if you're not just a split brain.
You're just segmented away from the network because your switch broke. Or is it just because the others went down? You just can't answer. You can't even tell what is the value for X, for example, in your key value store. You have to stop operation. And you also have to make sure that there's only one leader out there.
The unique leader, if there wasn't a unique leader at all times, regardless of if you have a split network or not, that would be just a recipe for disaster, obviously. It's not easy to get it right. I'm not sure if it's not fun.
Some parts were actually fun. It was very, very enlightening in many ways. But again, getting it to work is not easy in the end of the day. And then it turns out that these things like compaction and so on really make things a lot more complicated than the original protocol itself. Okay, let's actually play it rough a little bit.
Here it is, open still. I'm going to rewind this to the beginning because it's like four seconds into its life. All right. Five computers, part of this rough network.
And the first thing they do, as I said, is try to elect a leader. You see already these grey partial circles around the orange circles. And these are the time presets, what I told you about this random thing that we are going to set at the beginning.
And I'm going to start this whole thing. And it looks like S1 is going to win the race. When time's up, it sends to everybody a command, vote for me. And in this case, it was successful and everybody just agreed to that.
So anybody who would have come first would have gotten that positive answer. You can imagine that two of these times end up being the same, right? This is random, but it's not a big deal. Then maybe three vote for one, two vote for the other. The two votes are not enough, but the three votes are enough to become a leader.
And the next round, the new established leader, because it knows that it's got a majority vote, will suppress the other two also. So it will tell them, vote for me, and this time they will vote for the new leader. This is fairly robust, it works very easily. So even if you have a leadership change and you have massive load on your network
and massive load on your computers or so on, you do end up electing a leader fairly quickly. And it's also clear that as long as the rough thing is not working, just because we don't know what the ground truth of our cluster is, no, we are not talking about the data that you're actually putting into the database.
We're talking about the structural data, we're talking about what databases are out there, which collections are there, which indexes are there, and vice versa. What DB server is hosting, which shard or which collection and so on. This kind of information is in there. When the rough network goes down, the entire cluster will stop responding.
Everybody starts screaming in their log messages, something about a shaky network and so on. And really, there is no other way to work here. This is the CAP theorem, right? You can have consistency, availability, or what is the other one?
Usually somebody knows to help me out. Thanks, part. So you can have two of these only at the same time. And we at Orang-DB, we value in this case as a database consistency more and let's say availability.
At that point, we just have to say, okay, if we go on from here and lie to a request, that's probably worse than just say, we don't know what's going on right now, give me a second. Okay, now we can also, for example, stop this lead and see what happens. Hey, why don't you just continue?
Okay, time is going to run out as you can imagine on S3, it seems. It's going to, unfortunately, S2 is not, wait, yeah, see, so S2 was also quick enough, but it was not getting convinced enough of the other computers to become the new leader.
Read it, it's really fun, play with it. Also, maybe do something like send a request on the network, like add 100 euros to my bank account and see how this information propagates through the replicated logs. These are the replicated logs on all the servers and so on.
Unfortunately, I don't have more time for this. I could spend two days explaining what is going on here. But again, you will find nice talks online, play with it. I am amazed by this tool because this is one great opportunity to understand the protocol really nicely when people do this kind of effort to make it easy for you to read a paper.
And as playful as this is, there is a real mathematical proof in there, actually. Okay, sorting. Sorting is a big business. It's been forever. As long as there's been computers, there's been sorting.
We are unfortunately at a place where modern hardware has made most of the algorithms that we know that are taught in universities and so on pretty much rubbish. Really, just because computers have changed a lot, right? The problem used to be comparison computations.
What happened since back then is that computers have become CPUs 20,000 times faster, but memory access is only accelerated by a factor of 40. And then you also have 32 cores, maybe, on your CPU,
which means that the balance has fallen by three orders of magnitude to the wrong side for all these nice papers. And, okay, you have to deal with this because it's important. Actually, one of these algorithms is not rubbish.
I was lying just earlier. There's an algorithm called merge sort. And the merge sort effectively is explaining how you try to create sorted data sets first
and then merge them somehow together. As a matter of fact, if you go on Stack Overflow and look for how do I get two C++ vectors sorted into one big one, you will see pretty much people putting out this merge sort protocol. The problem is that you're still hitting the same kind of issue, that every time you go to grab something out of these vectors here,
your sorted vectors, you're hitting the memory wall. You're going to have a cache miss, and this is going to cost you dearly. The performance is going to end up being the performance of a CPU, really just because your memory bandwidths and you're only having these cache misses.
If you appreciate the inner workings of a CPU, however, you know that there are CPU registers right at every core and then there are multiple layers of caches that go all the way to the memory bank out there. Every time you need something, you would hope that you can get somewhere
where you would get as much from cache as possible. Of course, you have to go to main memory to get stuff to come back and pretty much feed your algorithm, but the thing is that as long as you do that in a sequential manner, you only hit one cache miss and then grabbed a bunch of data and brought them in.
The idea here is that we invest in ArangoDB this data structure that we call a min heap. A min heap is a balanced binary tree. Balanced means that it's just not very, very long in one side and shallow in other places.
And we populate with a syncing algorithm stuff from these sorted vectors into this binary tree. And the idea is that the only rule in there would be that any dot that is above others is going to have a smaller value.
This means that with a little luck, if you do that syncing properly, you end up doing the comparisons only in the cache now. This whole min heap you design obviously on your specific system to fit in the cache. And everything that is coming in there is coming even also through local memory.
So if you're thinking of a NUMA architecture, you're feeding 32 processors from localized memory effectively and end up doing the comparisons 95% of the time in the cache. If you compare, if you then go on your computer and make a top,
you see 32 processors actually maxed at 100% doing sorting only. I haven't given you any details to the syncing algorithm. The code is online, it's not so much. You have to do it carefully, but it can be done. Of course, ArangoDB's code is online
because this is an open source conference, right? So you can actually go and check the syncing mechanism that we built for that, but effectively it's ginormous. You can do the sorting in insane speed. Now, why would you sort stuff if you have a database? Well, usually people don't just throw in stuff into a database
and then never care for it. They come back and will ask you to give them some of the data. And they will probably have a discriminatory filter of some sort on it. Hopefully there is an index sitting on that discriminatory field and that's why we have to do the sorting.
An index is nothing else but an inverted key value store based on that specific field or maybe a combined index of multiple fields that you can use to quickly grab something that's been looked after. So, the merge sort.
Right. The next problem is, let's say we have a computer that has 16 gigabytes of RAM and we have one terabyte of ultra-fast SSD attached to it. And people would assume that when they start writing data in that the data really gets written quickly,
that you're not waiting for a long time to get some data through. And this is super crucial, okay? This is not just some nice gimmick that you give to the customer. If you cannot suffice the speeds that are necessary to deal with today's data, the problems appear somewhere entirely different. Then maybe your entire threading mechanism starts suffering badly
from this congestion and so on. So, it's not just that you're slower, you're creating a huge set of problems for the customer as well as for your own processes. It's absolutely crucial that we can do bulk inserts very, very quickly.
But at the same time, that software or person would also like to come while you're doing these bulk inserts to give you very quickly stuff out. And most of the cases you will see that either the classical B-tree structured algorithms and so on
will fail in one of the bottom two, one way or the other. Log-structured merge trees are a nice idea here. You start writing data bulk inserts from the top into this picture that you see here.
So, everything that you're writing, you're writing into the level zero. You have as a discriminatory value some key of some sort and with the value of that key, everything is sorted in different bunches. Everything that you get, you just sort directly and keep in memory
and at the same time make a write-ahead log on disk and keep a copy on disk for a power outage and so on. While you cannot say anything about this... Well, they are all sorted, right? But you don't know which key ended up going into which chunk here.
You don't have to, because the next level where the data trickles down to is a level where you don't, at least while they are sorted, you look that the keys don't overlap anymore. And then this goes on, down, further and further until on every machine all your shards of your data will look something like the bottom.
The problem is that, okay, this is nice, you can put a lot of bulk inserts. Ah, something that I forgot to mention. Every time that you pack up such a bar on the top, you don't touch it anymore. It's immutable. It just gets written once to disk and to RAM and dependent on parameters that you can give around with DB or other databases,
you will wait for the syncing to disk first before you answer or not. Depends on how important the data is. Session data might not be as important as customer data, as payment data and vice versa. But at the end of the day, you don't touch these bars anymore.
Every time you pack something like this, the only way of getting rid of stuff and making is repacking what we call compaction. So what is compaction? Does everybody know what compaction is? Compaction effectively means that, okay, there have been a lot of stuff that came in. Somebody said x is equal to one, somebody else said raise x by one, and then sometimes later somebody came in and said x is equal to 12.
The x is equal to one and the raising by one are of no importance anymore, right? I can just throw these information away, but I would never open a file like this and try to restructure stuff. I will just take whatever is still of importance and create a new file and sort stuff and so on and so on.
Okay, this doesn't still help with reading quickly out of this structure because, yeah, where would you find a specific key? At what stage of trickling down is the specific information that you want? If you're lucky, the stuff is still in the hot set
and you have a nice clever filter algorithm like the Bloom filter, which has, for the size that it takes, a lot of very nice gimmicks. What it does is it is truthful about everything where it says, when you say, is this key in that bunch here,
when it says no, you know for sure, okay, it's not in there. If it says yes, maybe it's lying, you don't know, you have to look. Okay, but the good thing is that you hopefully don't touch 90% of these packages. Right? That's the whole idea. So, there is a little fancier algorithm called Cuckoo
that needs about the same amount of data and is giving you a better probability as far as I remember. Unfortunately, I don't know anymore if we went down that Cuckoo pattern because I remember there were some frustrations and with the results that it did end up delivering in ArangoDB,
but we definitely use these kind of tricks to be able to find the data in this LSM structure that I just showed you in the other picture. So, very, very important, we first write just this bulk sorted data,
we go on keeping the stuff immutable. Out of those immutable bunches, we trickle down information into this chunked sorted stuff and then end up having one big bunch sitting there that is really nicely sorted and everything that you need quickly found and delivered, even if it's on a storage somewhere.
But again, together with the merge sort algorithm on the other page, you end up creating a locality of performance that is even extensible now beyond the single machine on the network. As long as you keep the data local, you can do the sorting,
you can do the answering of all of that effectively almost as quickly as if you were on a single server, with the exception that now you can scale on writes like crazy, hopefully, and don't forget resilience. That is, when you lose a server, your data is still there.
This is used all over the place. I just did some name dropping, every big name in the field does that already. This I will just touch upon very quickly, problem of synchronicity. Why is that a problem? Well, if you have to debug something, if you have to find out what was first,
you have to be able to establish causality. If you cannot do that, you're going to be screwed. If you try to depend on local times, local clocks, nice, but it's not going to give you everything. Max, out of some presumption of grandiosity,
brought also the general relativity, which creates a... But trust me, it's not an issue. We have computers here in the order of milliseconds we are far away from what the general relativity with the speed of light on copper wires and so on messes up with this 20 milliseconds probably somewhere in attosecond order or something.
Whatever. What I wanted to mention is there is tricks of lying about the time. I put a link here. I can't really go much into detail because I got shown that I don't have so much more time.
Did I see I still have 15 left or 15 past? Thanks. Very good. I was surprised. I'm not going to rush that much anymore. Right. What is the trick here? The trick is that with every message that you send across the network,
this is really with every message. Anything that gets triggered down in your cluster, every replication message, whatever it is, every heartbeat, everything gets a timestamp attached to it. Not surprising, right, if you want to establish somehow an order.
But the thing is that because the clocks are going to be skewed and you're not Google and can't buy an atomic clock for every rack or something that is a million just because you're going to make 10 millions out of it, you have to somehow work with NTP and NTP is going to have this kind of skews and it's fine. It's not going to be a huge problem most of the time.
So what do you do? Every time you send any message, you send a timestamp. Every time you receive a message, it's going to have a timestamp. If you see that timestamp to be larger of any timestamp you've ever had, or any time on your own boxes larger than the time that came in, you will establish a new time that is the maximum of these two times.
Now, this can be kind of odd. If you imagine your own daily business to work like this, you're going to screw up the most of the second part of the day. Because every time adding a second on top, if you're busy, it's maybe going to push you at some point two hours off from the real time and so on.
And this is not going to work, particularly if the rest of the people in the office are agreeing on the same time protocol. The good thing is that even if you have a busy cluster going at full speed, there are a couple of cycles to take a breath. And the time to catch up with this fake time
that you're administrating locally. Again, you get the idea, it's a little lying about something that is supposed to be a ground truth, but surprisingly, it helps getting things at least in order. So the time is not going to be so much of a real time, but it's going to at least help you get things in order in a cluster.
And trust me, if you're going back to Monday, Monday morning to a desk where you have to debug some really nasty crap, if it's involving five computers, it's not five times worse, but maybe something like 25 times worse. Actually, you can't put a figure on that.
It can be a nightmare. Without the, how do you say, RR. We use RR a lot in the office too. It's a nice tool that I think comes out of the Mozilla project. It's a deterministic debugger. You attach it at runtime to a cluster and later when everything is broken,
you are able to analyse a single machine of the cluster because RR collects all the system calls, all the network transfer and so on, allows you to find out what actually went wrong with the one server. If you attach a debugger to that server, what happens is everything stops, right? So if you think of the Raft algorithm that I showed you,
what happened to the one server that stopped sending stuff? It will immediately lose its leadership. So if you're trying to find a bug on the leader, yeah, right, you get the idea. So it's super important for us at least to know the order in which things went down. Hey, I have actually a slide with all the information that I threw at you.
Okay, here's the maximum that we take. And you always just, you create this logical clock. That's why it's called the logical clock. If there is something wrong when you want to call something a logical volume, a logical clock, it's probably not going to be real, right?
Again, at least causality is preserved and a couple of milliseconds of breath are enough to catch up with reality. So that your own clock exceeds that fake number that you've been keeping all along, just because some server on the network is 20 milliseconds ahead. Okay, it's not going to work if it's one hour ahead.
Okay, we can agree on that. Your system is screwed then, right? Okay. Now, this is getting really funny because this is just name dropping now. How do you go about selling a database to anybody who's supposed to use it if you can't really deliver transactions?
And I think for a moment you would have to create transactions on the cluster. What would you actually need? Well, the same thing that you need is when you expect a transaction to work on a single server. There's this nice... How do you... Yeah, whatever. It's probably not right.
You should stay with computer science and physics. These nice names where you just combine stuff to make... Whatever, ACID. ACID stands for atomicity, consistency, isolation and durability. What does it mean? Well, atomicity means that you can actually have an entire such a thing as a transaction.
That is, you have brackets around a set of commands and they are atomic. When something is committed, it's committing the entire bracket or none of it, right? Not part of it, not something in between, not the end of it.
Consistency says that every time that you actually do a commit, as a result, you don't allow somebody to get a read out of the database where only part of that atomic operation is actually visible to that read.
This means that it's nice and great that you could commit the entire thing. Oh, OK. Out of power? Whatever. I'm just going to shout. Is that... OK, there, I am back. Probably somebody hacked my...
It's a hacker conference probably. Anyway, so that's consistency means that it's nice that you had atomic write, but do I actually see that entire transaction on the read that I'm getting? Isolation means that if you have multiple transactions going on, that they don't affect each other.
So it's not very helpful if a transaction is trying to put some money in my bank account and another transaction is putting some money in some other person's bank account and these two transactions somehow influence each other. It's hard to sell that to a bank, right? And so that's what isolation is all about and durability is obviously
whatever you store in the database, you probably want to get back someday. So you have to look that it is persistent and that you have some kind of a failover in case that it breaks. This is already hard enough to get right on a single server.
You go into a network, as you would imagine, everything gets really complicated. So the atomicity now needs an agreement of some sort. Now, I told you, if you take Raft, Raft is going to be a designed bottleneck.
It's impossible to take your transactions and send them into a Raft system. Well, it's not impossible, you can't do that. Actually, that's what we do in the ArangoDB agency. But then you have to live... It's hard to explain why you take all these computers than to make things slower, right? So what you can do,
and this is something that we're just discussing actively in the company because we're not there yet, is using the Raft system for storing transaction information, effectively the brackets, so the brackets that create these atomicity. How do you want to create a consistent snapshot? Take three DB servers that are replicated another time,
so altogether six servers, and you're trying to put your transaction somewhere into the system on disk. What if one of the replica of one of the DB servers wasn't able to write the transaction because somebody yanked a power cable or administrator
through a cup of coffee outside the million-dollar machine room on the three-plug cable cheap thing from the warehouse. I actually saw that happen. This is not... And so it's really, really tough to roll back a transaction
in a database cluster. It's insanely hard, and we actually had a shot at it for a couple of months last year and realized that we have some insufficiencies in the class design of the abstraction layer where we are detaching practically storage engines
and the locking mechanisms that were involved, and we had to accept, okay, we have to do this properly again over, and it's going to come, I'm sure, but I can't tell you when. Okay, and then the isolation part
is going to be an entirely different ballgame because you have to be able to store something, the effects of a transaction effectively isolated somewhere until such time that you can flip a switch and say, this is the law of the land now. Whatever happens in that transaction is now everywhere ground truth,
okay, while your other transactions might be going on. This doesn't mean that the other transactions have to be successful, right? If there is a conflict, you will detect that conflict and roll it back, right? But that's exactly what isolation means. And then the durability, how do you handle a lost node? And trust me, that's even the smallest of these problems.
Nearly all the databases that are out there that you know by name, don't give you that in the network. Nearly all of them cover that on a single machine. And there are two notable mentions
because they've been, actually one that has been open source all along, I think, CockroachDB, and Spanner, which is a very big one, which is closed source. FoundationDB, Applebot, as most of you may know,
and then just a couple of months ago, put it online and made it open source. And we are doing our best to get there. Again, you see the parentheses around are not a typo. Because it's really freaking hard. Okay, one idea of how to accomplish what I just said,
acidity in a cluster is what is called MVCC, Multiversion Concurrency Control. Multiversion, well, multiple transactions will have multiple versions, and these are going to be concurrently somewhere in your cluster before such time when you say the control is going to switch
to the correct transaction and make them visible. So, writes and replication are done decentrally and they are not visible from the other transactions. But then you have some place where you have a switch that you can flip and then everybody at the same time knows all over the cluster,
at least until such time that somebody comes and does a read request or another transaction is going to be written to disk, what is the ground truth currently. Yeah, this is just repeating what I said over and over again. Timestamps obviously play a massive role
because you have to know what happened, what started when and ended when. And with that, I'm going to put up links to all the clever people that did the work that we implemented in RangoDB. And let me see if I have, no, I don't have any more links. I just would invite you to try RangoDB.
When I first used it, I thought, what the hell? A database with a UI. I thought they usually have just a prompt. And if the UI is going to be PI3 MySQL and something like that. But there's a serious database behind it. Don't get shocked by a nice UI.
And, yeah, please go on GitHub. And if you like it, star us. And the slides are going to be on my speaker page and I'm sure on the conference page. With that, thank you very much for listening.
You talked about scaling out,
but you mentioned that you completely replicate the data in the cluster. So, scaling out is just scaling out for performance or also for data, because that would require a partitioning mechanism. You didn't talk about any of that. No, as I said, so I had actually a nice joke to start with
and I forgot that, which is not something that I usually do. So, on the page where I had all these nice names that I was going to talk about, I was going to tell you that this is going to be a very, very incomplete picture of what's seriously going on inside RangoDB. And so that page was pretty much trying to explain to people Europe
by just mentioning London, Schweinfurt and whatever, Sweden and some lake in Switzerland or so on. This is really all just patches. Yes, so in RangoDB both ideas are accomplished. You can scale data by sharding collections effectively without limit,
practically probably with limits and also replicating data. So, it's a totally normal kind of cluster when we have a customer run 15 DB servers
where you have shards replicated three times. So, you have sharded your collections over a lot of DB servers and all the DB servers somehow on some other DB servers are running replicas of the... And then in the agency, in the rough thing, there is then absolutely vital part of RangoDB sitting
that is looking at all times, monitoring the entire cluster. This is nice thing about Raft that I would like to just mention as a side note is that if you really want to squeeze the last bit out of Raft is you probably put some program store along on Raft.
That is you don't have just a state machine with valid variables, but you also have some piece of code that you can just plug into that and in this case a supervision for the cluster that at all times knows which DB servers have given out and where shards need to be... Some DB servers are supposed to become new leader of some shard
and then some other replica is assigned and vice versa. So, yes, it becomes very, very complicated in the back end, but RangoDB doesn't only replicate but shard.
Okay, one question. Ah, there's another one. I'm not very familiar with this topic, but a lot of this has some parallels with blockchain technology, I think.
So, how would you see, what similarity could you see in there and why not using something like blockchain in this topic? Yeah, so I'm not an expert in the blockchain, so my answer might not be completely correct. Okay, that's the disclaimer. But yes, blockchain is a consensus protocol,
but it has also got some limitations that are not so great, which is, for example, a blockchain, as we can see with Bitcoin just now, at some point might hit some kind of an artificial limit and needs to be made larger or something.
And also the way a blockchain trickles down information is compared to what you can do with Raft, by its nature slower, and a blockchain will also live with some inconsistent state at some places. So, you can have a transaction that is in your own blockchain,
not in the majority of other blockchains, and you would actually report on that if somebody would ask you. So, there are some fundamental comparable parts, but I think in general it's not really meant for this kind of ground truth.
It's more... So, I would compare that maybe between having your revision control completely spread over notebooks all over the world, or do you have a central server? I think that maybe best describes the comparison.
I hope that was true first and helped second. All right. Thanks, everybody. This is a Sunday afternoon. I'm flattered. Thank you very much.