We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Galera Replication Demistified

00:00

Formal Metadata

Title
Galera Replication Demistified
Title of Series
Part Number
50
Number of Parts
110
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Salem, IllinoisReplication (computing)PermianDiscrete element methodStandard deviationSynchronizationEvent horizonParallel portMathematicsServer (computing)Sheaf (mathematics)Random numberPrincipal ideal domainTuring testMultiplication signReplication (computing)Database transactionMereologyNumberSequenceoutputCASE <Informatik>Address spaceDifferent (Kate Ryan album)Process (computing)Thread (computing)Table (information)Set (mathematics)Game controllerMathematicsServer (computing)Group actionRevision controlCartesian coordinate systemClient (computing)DatabaseFilter <Stochastik>Public key certificateSinc functionOrder (biology)Ring (mathematics)Single-precision floating-point formatInternet forumEqualiser (mathematics)Reading (process)Rule of inferenceCommunications protocolRight angleWebsiteRhombusLevel (video gaming)Scalar fieldService (economics)TelecommunicationNumeral (linguistics)Wage labourMusical ensembleWordCuboidSimilarity (geometry)PressureTotal S.A.Form (programming)Strategy gameStudent's t-testFile formatParallel portVarianceExecution unitEvent horizonFlow separationElectronic mailing listAdditionView (database)Gene clusterUniverse (mathematics)XMLLecture/Conference
Replication (computing)Term (mathematics)Key (cryptography)Table (information)Salem, IllinoisBootingHacker (term)Discrete element methodOrder (biology)DeadlockDrop (liquid)Process capability indexMachine visionDatabase transactionRevision controlVotingSet (mathematics)ResultantRight angleProcess (computing)MereologyForm (programming)Presentation of a groupSparse matrixEndliche ModelltheorieFrequencyMultiplication signSlide ruleDialectVirtual machineProcedural programmingPhase transitionRule of inferenceLattice (order)MathematicsNumberSheaf (mathematics)Variable (mathematics)Server (computing)Source codeOracleSequenceState of matterGraph (mathematics)RadiusElectronic mailing listTheoryNormal (geometry)Inverse elementKey (cryptography)Negative numberConnected spaceGroup actionWell-formed formulaDecision theoryPhysical lawWordDefault (computer science)Public key certificateThread (computing)Client (computing)Commitment schemeBitCommunications protocolCartesian coordinate systemTerm (mathematics)Uniqueness quantificationCuboidFlow separationRow (database)Replication (computing)Block (periodic table)WritingComputer wormEvent horizonFile formatTable (information)DeadlockCommunications systemOrder (biology)SynchronizationMessage passingPoint (geometry)Gene clusterStructural loadLecture/Conference
Salem, IllinoisDiscrete element methodInvertible matrixPell's equationForceControl flowDataflowMechanism designFeedbackRight angleElectronic mailing listContext awarenessMachine visionSequencePoint (geometry)PlanningInsertion lossShift operatorPhysical systemLattice (order)Arithmetic meanGroup actionUniverse (mathematics)Server (computing)QuicksortForm (programming)Multiplication signFaculty (division)Limit (category theory)Revision controlConnected spaceService (economics)Reading (process)Population densitySheaf (mathematics)Bit rateExecution unitLocal ringVirtual machineType theoryInferenceForcing (mathematics)Process (computing)WeightInformationCondition numberDivision (mathematics)Different (Kate Ryan album)MereologyRow (database)Greatest elementDialectSocial classFile formatFitness functionError messageStatement (computer science)Database transactionQueue (abstract data type)Public key certificateDataflowSlide ruleSynchronizationMultiplicationGame controllerView (database)Metropolitan area networkReplication (computing)Configuration spaceDeadlockSet (mathematics)
Core dumpGoogolComputer animation
Transcript: English(auto-generated)
Yes, can you hear me? So I wasn't expecting that much for the last talk. Yeah, I can do it with the community here.
Okay, that doesn't work here. So, let's start very quickly. So my name is René Descon, you can hear I'm speaking French usually. I'm tired, so if you don't understand some of my bad English, just let me know.
So I'm working for Percola, I'm a senior tutor. I do my skills since a lot of times. And what is fun is that I started to use Galera since February 2010, when I met Seiko here in Fosdam. And I started it.
So today the talk won't be too technical, but Wu here is using Galera. With MariaDB, with MySQL community, with Percola.
So what I will show you now is valid for all of them. So it's about Galera and how Galera works. Wu thinks it's a bit magic and doesn't know how it works. So I will try to explain the magic today. I won't explain everything about what's easy, I would say.
We'll go in really where it's tricky, for the certification for example. This is where it's more a bit complicated and I won't come up to explain it. But for the people who don't know Galera at all, usually when we talk about MySQL replication, we have a master and a slave, right?
So this is the industry between the master and the slave, and this is what we call server-centric replication. We have a master and a slave, they have a role, right? With Galera, here it's different, we are data-centric, so we don't care about the role of the server, but we care about the data.
So don't try to cut filters like, oh, I don't want to replicate this table or this database. Don't play with that, it won't work. So you have data everywhere and you see it like one data, right? You don't care about if they are out of signal or not, it will worry for you, and you can write everywhere.
Normally, in the end of the workload, but it works. Something which is nice also is that on a database you write in multiple threads, so you don't have only one client that writes, and you can also apply that in multiple threads on the other nodes.
The threads are wrong to save, you have to refine that here, right? But you can do it. By default, you can make more. So what's the replication effect? Replication is to deliver the write set that are going into the nodes,
to all the nodes in the cluster. So you have a write set transaction and you need to deliver it to the other nodes, and all the other nodes acknowledge they've got the write set. Then, the cluster generates a GTAD that will be global, and that will be used to order this transaction,
and the cost is a road trip latency to the furthest node. So if you have nodes, let's say in Europe, and you want it in Japan, the speed of your cluster will be the speed to reach that node in Japan and come back. Okay, so watch out. You say, oh, it was fast, we were happy with this,
we just removed one node in Japan, and now it's very slow. Our application is slow. It's normal, because the latency there is much higher, right? So the GTAD is serialized, but you can anywhere write and replicate in parallel,
because they communicate. So let's talk about these GTADs. So they are not the same as 5.6 asynchronous GTADs, but they look the same. This is one GTAD, so nothing special about it. And these GTADs, they will ensure that all the members are consistent
with each other, right? So when they join, they say, oh, this is the GTAD I had, or I was to test the GTAD, or this is my current GTAD, what's the status of the cluster? And then we know where we are in terms of data, right?
So when a node joins the cluster, we check the GTAD. And we say, oh, we are missing some transaction or not, and if they are missing some transaction, we're going to give them the data to that node. So when we stop all the nodes, we can use the GTAD also to compare them. So where is my most accurate node, right?
And we'll use the GTAD to see that also. So IS-GTAD means the most recent data up-to-date server or nodes, and so it's better to always restart. So if you stop all your nodes, check first which GTAD is the highest, and then restart and bootstrap your cluster from that node.
Depending on what version of data you are using, this is automated in some version, in the new version. So, some more about GTAD, so this is what it looks like. And the GTAD, in fact, represents the dataset ID, right?
So we have here dataset ID, or the cluster ID, same everywhere on all the nodes, like they said this morning in the MySQL 5.7 group application, and then we have this transaction number, right? So we know, okay, this is the transaction we have done, and every time we do a transaction, we increment this counter here.
So if we compare them, they look the same, right? But this is the dataset, cluster ID. In 5.6, this is the server ID. So every server will have their own GTAD,
this part of the server ID of the GTAD. This is the data change inside of the cluster, and this is the transaction that are processed for that server, right? So in 5.6, we write data, so we increase, and then we promote a new master, so it's another node.
So it's another ID here, and that starts processing transactions, so we have changed, and we can see it from there. In Galera, let's say we write always on the same node, then it's another node that gets the writes, nothing change, we just increase, okay?
So this is the big difference between these two. So how do we assign this GTAD? So the new ID part, it's one of the 78 bits, and let's say it's completely random, and there it's, yeah, it should use the MAC address and stuff,
which is not really the case. You can find a lot of input here on this GISP sub-information, related to the code, so where it's generated, if you want to see. But let's say it's random, so there is nothing that, you don't take control of it, I'd say. The sequence number, this is a 64-bit,
and it's generated during the replication itself, and it's the group replication. So when you do something, and you write, they communicate together, and they decide what will be the next sequence number. And for doing that, they use an algorithm, which is called Totem Single Ring Ordering Protocol.
I cannot explain everything in that time on what's inside of it, but this is how they decide what will be the next GTAD. So just Google for it, or in Wikipedia, and you will find really some white papers on how that works.
So replication rules. Remember, I said, okay, we don't care about who is the master, who is the slave, because we can write everywhere, and everybody gets the reads. But within the cluster, so all the nodes are equal. But we also sometimes discuss about, or if you check the forums or the books,
about the master node. The master node means the node where the application was collecting and write the data. So you can switch off them, but for one transaction, we're going to write somewhere, and this is the master node. So this is where we're going to commit the transaction. And the slave node, it's the other node that will receive the transaction.
So this is even in Galera, we also say master and slave. But this is just for a given transaction, not for the server itself. So this has to be very clear, because sometimes people insert books,
and then we discuss, yeah, which node was the master. None, because I'm doing a secret replication, I don't have master. Yeah, but for your new transaction, there was one master, so which node did you wrote? And what's a write set? Because I always say a transaction write set.
It's just the term in Galera for a transaction. It's one or more row-based replication event. The binlog format has to be row in Galera, and so this is this kind of event. So the write set is this RBR payload. It's a block box in Galera.
And we also need replication keys. The master node will generate that. And these replication keys, we have the primary keys. So if you remember, or if you were using Galera for a long time already, at the beginning, if you not had key, it was very bad.
You should have the primary keys. So a table without primary keys at the very early, let's say, age of Galera, it could not even work. So primary keys, unique keys, foreign keys, table name, and schema name, all of these together will be part of the write set to be able to make this certification
to see if there is conflict in the data we are writing, right? And in the new version, if you don't have these keys, it can generate for you. It's always preferable to have primary keys. All InnoDB best practices are still valid
and much more valid in Galera than in InnoDB itself. So the keys are shared as specific, and this is what we use to make the certification possible. So we say replications. For us, what's replication? It's I write data somewhere and gets somewhere else committed.
Oh, I'm happy this is replication. But replication, when we talk about a synchronous replication, it's a bit more, because we can split this replication in several steps or phases, and the first one is apply. So we need to apply the change, so the data itself,
then we need to replicate it, we need to certify, and we need to commit. And this order can be different depending on what's the role for that transaction. And I will show you, for example, on the master, so the machine where the server gets the transaction to be done,
it applies. So you modify the roles, then it sends the replication to all the other nodes, then certification happens, and then if certification passes, it commits. On the slave, it gets the data first, then it certifies,
and if certification passes, it will make the change and commit. So this is different. And what is then the certification? I go a bit fast, we can discuss later, as there is nobody after me, but I would like at least to go through the slides. So what's the certification?
The certification is just answering this main question. Can this write be applied on the machine? I am trying to do the certification. So if I can apply, so there is no conflict, then yes, certification passes. If there is something that blocks me to write that data on the node,
then certification fails. You're going to see that after. But also, and how, what do I need to check? I need to check the data, and also based on the unapplied earlier transaction on the master. So it could happen, so when I do the certification, that there is some still transaction
that are in the queue, that are not yet applied, and I need to verify there are two, because depending on the speed of the node, I need to verify all that to be able to make this certification happen. So when there is a conflict, the conflict should always come from another node. So that's why, depending on your workload,
if you have a data, one table or one record, you always change. Your application always changes the same record. Then it will tell you, Galera is maybe not good for you, or at least not to write on all the machines at the same time. You need to write only on one node, and if that node crashes,
write on another node, because there will be too many conflicts, because you will write on all the nodes and post the same data everywhere, there will be conflict. Certification will happen on every node, and is deterministic on every node. Here, the results are not reported to the cluster.
So this is, I don't know if, yeah. Before you said, oh, we write and we wait, that they get, before we could commit it. There is no communication system. Okay, okay. Okay, so here is also the thing. So we send the data,
when we know the data is arrived to the slave, or let's say to the other nodes, then we don't care about them anymore. It's enough. So if they're certified or not, we don't care. It's up to them. I will show you that. I will explain you that. If it works, so if certification pass,
we go in a queue, right? And the commit works, will be done on the master. If it fails, we have to drop the transaction. And on the master, we give a deadlock. It's not a real deadlock. It's a deadlock that mostly happens after we commit. Because, and that's where the application
needs to be able to do that. So when it commits, it gets a deadlock, and deal with that. Because it's the only way we can discuss with the return that certification failed without changing all the master protocol. And the certification is sterilized by GTFD. The certification has a cost.
And the cost for it is the amount of keys or the amount of rows. So if you have, let's say, by default you're limited in one gig, and you can go to two gigabytes of changes, but you can understand that if you have always to pass and certify all these two gigabytes of data,
it could be a lot of work. Apply is done on slave nodes after certification. They are paralyzed if the reserve slave threads is bigger than one, not the default. And if there are no other write set that could conflict with it that are being applied.
The cost is also the size of the transaction. So that's why. If you have a large transaction, it's very bad for Galera when you write everywhere. Because then you will have more conflict happening. And this is where you try not to do. So on the local node, if the apply doesn't work, we will have a boot for support.
And I will explain you what it is just after. So when we commit, so this action is just writing the data, committing like the normal InnoDB commit, and like flush-lockup transaction commit and it's done. We serialize also that by the GTID.
And this is done by the apply or threads on the slave and on the client thread on the master. So now I will show you some graph on how it works. With AutoCommit, what happens is we do a change on the one node. It replicates.
Then it's certified. And here what you can already see is this is the only synchronous. When we say synchronous replication, yes, replication is synchronous. But the commit and certification is not synchronous. It will happen at a different time. We try to make it as fast as possible.
But it's not true. It could take time here in between. We don't know the load of the other nodes and we don't care, in fact. Of course, if you have a node that is lagging, you will see later, it will cause that all your clusters will be lagging at a certain point. But you can see, applying and finaling the commit
never happens at the same time. It could be a very long time after. Of course, more on the list, worse it could be, and more, let's say, conflict you could have. So you should be very, very careful on that. So it's the same for when we do a transaction,
full transaction, right? We do all the statements, it gets replicated, and we commit at a different time. So when we say synchronous replication, yes, replication is synchronous. But that's the only thing that is synchronous, not everything.
So not like when we say MySQL replication, because let's say in the past, the replication was I write data and it's replicated, it's when the data is written on the other side, when we say we have a synchronous replication. Here is not the case, right? Just to show you how the traditional locking works,
so when we have on the same machine multiple transactions, and we need to wait here, because we have a lock, right? And on a cluster distributed system, we don't distribute, or at least not yet, with the gather version we have until now,
we don't distribute the locks, so they don't know what they are here. They are two different transactions on two different systems, and it's only a commit that they start talking together, and at this time, the second one, so the latest one,
will have an error because it has a conflict, but just went after the commit, not before. So here is just some explanation of our certification failure, right? So we change same data, we commit here, we synchronous replicate the data, certification is happening on all the nodes,
then certification will succeed on the nodes, when it succeeds, we go in the queue, right? Everywhere it succeeds, we commit, the commit succeeded here, the data is there, then it's still on the queue to be processed, where we commit here, certification will check with the queue,
if there is nothing that has to be applied that could conflict, and here, certification, when it commits, certification will fail, here, the certification for this will fail also because there is a conflict, here, it has a conflict because we keep the information we have changed,
and so here, certification failed, we're gonna return an error, and everywhere, certification will fail, we will not commit the information, and here, we're gonna add a rollback, and we send the application, okay, we have an error. So this is just some slides to explain how that works,
because we have two types of certification issue in Galera, brute force abort and local certification failure, and so this is some illustration on how that works. So here, you see we modify, we have a transaction that's open, we modify a data, here we have the same data that we modified in another node,
when we commit it, it gets replicated, it gets certified everywhere, then when it works, it's applied everywhere, when this one, we will do, the next thing we will do here, whatever it is, we will have a deadlock coming, the deadlock already happened by this one,
because he said, oh, I have to brute force abort that transaction, but you will get the answer only at the next statement here, not before, and mostly, it's always the commit, and then you get the error at that time, okay, but this has been already implemented.
LCF, local certification failure, same stuff, we do, you see, we do a transaction, they are in the queue, they are committed, then on this one, we replicate, and here, there is a conflict between something that is in the queue, and then we have, again,
a deadlock, and this is the certification failure. Time's up, I know. There's dinner. Sure, but you know, I wait, I give five minutes to everybody, no, not five, two or three minutes to everybody, I need to recover them. But there is dinner, that's true. Just one thing on flow control,
because you have seen the nodes are not committing at the same time, and they can have delay, we need to be able to control that delay, and this is where flow controls happens, and this one, it says, okay, we allow some delay, but not too much either, because if we have a too long delay, we could have more and more conflicts,
right, and this is what you need to monitor, in fact, on the Galera node, is that you don't have this flow control going back, right? So it happens only when the receive queue is higher than the limit that we give for the flow control.
And when flow control happens, what it really means is that the node will say to the other one, stop processing anything, I need to flush my queue. So I'm too slow, I need to do some work, and there is too much in my queue, I reach the limit, so everybody else will not do anything, so your server is completely stalled,
you cannot write on the server anymore, right? So if it's too low, you will have flow control coming too many times, and it's, oh, let's make it very high, but if you go very high, you will increase the conflicts if you write on multiple nodes. So watch out,
and this will also increase the app like that, just to verify many stuff when doing the certification. So if there is always the same node with flow control, then check what it is, because it has a bad hardware, or the configuration is not the same compared to the other one.
So I will let you show this graph, but you just know that we see this one is not applying anything, and he has the limit here, and he has the queue. At a certain point, when we reach the queue, it says stop to all the other ones, and everybody waits. Nothing happens,
because they need this one to process the stuff, okay? That's it. Do you have a question? Yes? No. No, it's deterministic, so it will do on himself,
so it does the certification. You see with this queue, for example, you see if there is nothing in the queue that he has not yet processed that could get conflict, and if it passes there, it will pass everywhere, because the GTA idea, because we do optimistic replication, let's say like that. So if we pass once,
we say it will pass everywhere, and if it fails somewhere, then that node, so for example, it passes everywhere but one node, that node says, oh, for me, I could not do it, so I have an issue. My data is fully corrupted, and it stops working. It goes out of the group by himself.
No, it goes out, and then you have to restart it, or something like that, and then it will, yes, it will rejoin and get the data. Yes?
Yes, yes, by default, yes, and then you have some settings that you can see, depending for inserts, delete, you can see some data that you see. When I will read data, I want that the queue is processed.
It's me, the deputy manager, but I'm the last speaker. My t-shirt is over there. He's like the man in charge that everyone has a different time. So we have the room until seven. So yeah, you can, but you can play with that, and say if you want to read,
and be sure to have the same data, you can say you read only when the queue is processed. Another question? Okay, see you at the view. You can change. On the first version, no.
It was just for everything, and now you can say for reads, for updates, for deletes, for inserts, for everything you want. And it has some increments, so it's from one to six, and one means this, two means that, and three it's inserts and delete, and so you can play with whatever you want on that.
Okay, see you at the community dinner. I hope you enjoyed.