The consequences of sync_binlog != 1
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 490 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/46886 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Chemical equationWebsiteSynchronizationEqualiser (mathematics)Computer animation
00:28
SynchronizationEqualiser (mathematics)Commitment schemeComputer animation
01:04
Replication (computing)Link (knot theory)Slide ruleMereologyMultiplication signSlide ruleSynchronizationParameter (computer programming)Different (Kate Ryan album)Replication (computing)Source codeComputer animation
01:51
MiniDiscComputing platformGoogolPoint cloudSynchronizationThread (computing)Price indexBenchmarkSynchronizationConfiguration spaceReplication (computing)BenchmarkSubject indexingClient (computing)MereologyMultiplication signInsertion loss2 (number)Different (Kate Ryan album)BlogDatabase transactionMiniDiscDatabaseSource codeComputer animation
03:16
Vertex (graph theory)Database transactionBlogBinary fileThread (computing)DatabaseLocal ringMiniDiscUniform resource locatorData bufferOperating systemDatabase transactionLoginMiniDisc2 (number)SynchronizationBinary codeSystem callBuffer solutionSet (mathematics)MereologyRotationMetrePoint (geometry)Operator (mathematics)Commitment schemeComputer animation
05:51
MiniDiscComputer networkPoint cloudIntegrated development environmentGradientEnterprise architectureService (economics)Local ringSoftwareMiniDisc2 (number)LaptopPoint cloudRight angleFiber (mathematics)Database transactionData storage deviceRAIDMultiplication signEnterprise architectureType theoryCache (computing)Source codeComputer animation
07:20
MiniDiscACIDData bufferBlogSynchronizationBinary fileDatabase transactionData recoveryLoginConsistencyBuffer solutionSynchronizationDatabase transactionOperating systemBinary fileCommitment schemeArithmetic meanSource codeComputer animation
08:33
MiniDiscBinary fileRotationModemLocal GroupMathematical optimizationDatabase transactionParallel portDefault (computer science)Revision controlParameter (computer programming)Default (computer science)Axiom of choiceSource codeComputer animation
08:55
Database transactionLimit (category theory)MiniDiscReplication (computing)Software testingBinary fileLocal GroupDifferent (Kate Ryan album)Database transactionMultiplication signCommitment schemeMiniDiscSynchronizationComputer animationSource code
09:34
Database transactionLocal GroupBinary fileParallel portThread (computing)BenchmarkServer (computing)Price indexMiniDiscDatabase transactionEqualiser (mathematics)Roundness (object)Scaling (geometry)GoogolClient (computing)Thread (computing)CurveComputer animation
10:33
Local GroupBefehlsprozessorParallel portDatabase transactionReplication (computing)Thread (computing)BenchmarkThread (computing)Link (knot theory)Database transactionGroup actionContent (media)Replication (computing)Insertion lossAxiom of choiceEqualiser (mathematics)BenchmarkParallel port2 (number)Computer animation
11:18
MiniDiscData bufferDatabase transactionBlogBinary fileSynchronizationData recoveryGraph (mathematics)Web pageMultiplication signSlide ruleDatabase transactionComputer animationDiagramSource code
11:54
Thread (computing)BlogSynchronizationBuffer solutionComputer fileCrash (computing)Physical systemPosition operatorLoginSynchronizationReplication (computing)Control flowDatabase transactionConsistencyComputer animation
13:27
Replication (computing)Replication (computing)State of matterDatabase transactionMoment (mathematics)Computer animation
14:45
Replication (computing)Configuration spaceCrash (computing)SynchronizationCASE <Informatik>InformationReplication (computing)Operating systemLoginBinary codePosition operatorComputer fileCrash (computing)Source codeComputer animation
15:44
Replication (computing)BlogSynchronizationRotationData storage deviceCloningOracleTelecommunicationReplication (computing)LoginBinary codeSynchronizationPosition operatorParameter (computer programming)Reading (process)Source codeComputer animation
16:38
BlogSlide ruleMoment (mathematics)LoginOperating systemSynchronizationCASE <Informatik>Client (computing)Asynchronous Transfer ModeCrash (computing)Equaliser (mathematics)Source codeComputer animation
18:54
BackupReplication (computing)Data recoveryDatabase transactionReplication (computing)Position operatorComputer fileCrash (computing)Moment (mathematics)LoginTable (information)BackupSynchronizationSource codeComputer animation
20:45
Point cloudMiniDiscComputer networkSynchronizationReplication (computing)Sieve of EratosthenesImplementationBlogConsistencyMachine visionRoundness (object)State of matterCrash (computing)Asynchronous Transfer ModeSynchronizationPoint cloudOperating systemSource codeComputer animation
22:05
CloningRWE DeaOracleMessage passingBit rateLink (knot theory)Bit rateLine (geometry)Multiplication signGame theoryOperating systemSynchronization2 (number)Server (computing)View (database)Replication (computing)ResultantConsistencyCrash (computing)Context awarenessCASE <Informatik>Software testingRevision controlDatabase transactionMereologySource codeComputer animation
26:35
Point cloudOpen source
Transcript: English(auto-generated)
00:05
I'm Jean-Francois Guernier. You can read my bio on the FOSDEM website. I want to take as much time as possible to talk about this subject and not me, so I'll start right away. So the title of the talk is The Consequences of Sync Binlog Not Equal to 1.
00:28
Actually, the full title of the talk should be The Consequences of Sync Binlog Not Equal to 1 and of ENO-DB flush log at TRX commit equal and not equal to 2, sorry.
00:40
Because like one without the other doesn't really make sense. Like if you still sync ENO-DB and you don't sync the binlogs, like you're still adding the cost of a sync. There might be edge case, but like that's what we're talking about. And I will use in the talk TRX commit for short because like this is just too long to write in slides.
01:03
So in the abstract of the talk I say that it's faster, but we'll see by how much. Then we'll talk about what is replication and what are those two parameters. I'll talk about avoiding setting Sync Binlog to a value different than 1.
01:23
Then the consequence mitigating and closing comments. So this talk is mostly about MySQL 5.6 and 5.7. Most of it also applies to ADO. Some things about MariaDB, and I will explicitly mention it, and I don't waste time taking pictures of the slide.
01:43
They're online and you can follow the slides right now on your phone if you're not able to see the lowest part of the slides. So faster by how much? So like we're in benchmark land here. So I did, I used sysbench single-threaded insert benchmark without secondary index
02:02
because I really want to test like the syncing part of MySQL on Google Cloud with SSD persistent disk MySQL 5.7.26. So Sync Binlog equals 1 and TRX commit equals 1. 200 transactions per second single-threaded. This is not a lot.
02:24
On the slave it's a little more because here my client is remote from the master. So there's a round-trip time between the client and the database. On the slave there's no such thing. If I reduce durability,
02:40
so Sync Binlog equals 0 to X commit equals 2, 3.7 thousand transactions on the master 18 times faster and 7,000 transactions per second on the slave 30 times faster. So obviously people adding a replication lag problem. This is a very tempting configuration to do to solve a replication lag problem, but it has consequences.
03:04
We'll see about this. The whole benchmark setup is discussed on one of my blog. So this is like, this is what we're talking about. This is the speed difference we're talking about. Quick overview of replication, a transaction commit on the master.
03:21
It goes in InnoDB and in the binary logs. Those binary logs are downloaded by the IO Tread. Sorry, it's the lowest part of the slide. Are downloaded by the IO Tread that connects to the master, download the binary logs, put them in the relay log, the SQL Tread executes the relay log, put data in InnoDB and eventually the slave
03:42
produces binary log if log-slave update is enabled. So Sync Binlog and TRX commit they're here. Those are per meter controlling what happens on InnoDB on commit and what happens on the binary logs on commit, and they're also here.
04:02
So Sync Binlog equals something, zero to any number. Well, I guess it's limited by a size of an int. So Sync Binlog equals something. The binlogs are flushed to every, to disk every something transaction. So if Sync Binlog equals one, the binary log will be flushed to disk.
04:23
So that means the data is actually on persistent disk, not in an operating system buffer after each transaction. If you set this to zero, the binlogs will be in the operating system, but you don't know if they're on disk. They will actually be flushed at every binlog rotation.
04:42
There are some problems about this, but I won't go into the details. So this controls if our binlogs are persistent on disk or not. TRX commit can be equal to one, two, or zero. One, the redo log, so the log of InnoDB, is flushed to disk after each transaction.
05:03
Two, the redo log is written to the operating system. So the data sits in an operating system buffer, but is not flushed on disk at this point. It will be flushed on disk every second for InnoDB housekeeping. And zero, we're not really talking about this.
05:23
It's like we avoid the system call at each transaction. The data is in a MySQL buffer, in an InnoDB buffer, and it will be sent to the operating system once in a while, every second. So that's what those two settings are controlling.
05:42
So I mentioned flushed. Flushed to disk. So this means that the data in an operating system buffer is actually persistent to disk. So this is not fast. So in an old spinning-rust disk, a flush was taking 18 to 25 milliseconds.
06:01
So that means we can do only 40 flushes per second. So like 40 transactions per second. That's not a lot. Consumer-grade SSDs, so what you probably have on your laptop right now, it's very variable. A flush can take up to 10 milliseconds, and on
06:21
some of the SSDs, it's 0.5 milliseconds. Just do an IOPing on your laptop. You'll know how much time it takes. Like enterprise SSDs, so all those are local disks, like one tenth of a millisecond. If you have a RAID cache with a RAM battery backup RAM, it's 0.04 seconds.
06:45
But we don't have only local disks. We have network disks, fiber channel, iSCSI, like any type of SAN. That's the network latency, 0.5 milliseconds to 1 millisecond, depending on the efficiency of your network. And in cloud environments, disks are remote.
07:03
So GCP or AWS, like between 0.5 milliseconds to 1 millisecond to go to the storage. In AWS, you can have local SSDs which are efficient, and that's fast. So like, flushing to disk is not fast.
07:21
And what does this mean? If So what does this mean is that every transaction is durable, like SYN BIN log equal 1, TRX commit equal 1, after each commit transactions are on disk, so we don't lose anything. But if it's equal 0 or equal 2, if MySQL D crash,
07:40
then the data is not lost because the operating system is still alive, and it has the data in the buffer. But if the operating system crashes, we lose transaction. And also, maybe InnoDB and the binary logs are not in sync. So here, if we do transaction A, B, C, D, and so on,
08:02
maybe the binary logs are synced up to E, and InnoDB is synced up to K. And then we do other transactions, and we crash. At recovery, the binary logs will contain up to E, and InnoDB will contain up to K. So we have a problem of consistency here.
08:21
And there's also the other scenario where the BIN logs are ahead of InnoDB. It's less, there's a less probability, but it can also happen. Skip this. So MySQL in 5.7 or
08:41
8.0, by default, the defaults are not now safe. MariaDB defaults are not safe. I'm not, it's not the best, like, the choice here is theirs. I'm not sure I agree with this. So, we would like to avoid
09:02
setting sendBinLog to something different than zero. So the solutions here are either to get faster disks or to run things in parallel, and so a single sync will persist many transactions at the same time. So this is a binary log group commit on the master, or if you use parallel replication,
09:21
so if you use, if you run more than one transactions at the same time on slaves, maybe you'll be able to commit many things at the same time. So we'll explore this a little. So we remember the numbers. Very low transaction throughput with syncBinLog equal one.
09:41
And as we use more and more thread, we're able to get a better transaction throughput. With syncBinLog equal zero and single-threaded, we have a more decent transaction throughput. This is not with replication. This is with a client doing a round trip, so around 3,000. But this curve grows much faster than the other.
10:03
So even if we're able to get a better transaction throughput with multi-threaded on the master, like, we need 100, 200 threads to get a better transaction throughput. And syncBinLog equal zero, like, we have better. So when you're looking for the best transaction throughput,
10:22
like, syncBinLog equal zero is still better. This is with four vCPUs in Google Cloud, and this is with 16. So it scales a little. With replication, MariaDB has something called slave group commit, which is pretty exciting.
10:41
It basically gets the transaction throughput of syncBinLog equal zero with many threads without the consequence of it. You can look it up. I have a link down there. And parallel replication on the benchmark I'm doing, on the insert benchmark, is like, this is the worst benchmark for
11:00
parallel replication because there's a lot of contention. SyncBinLog equal one, we scale. But with syncBinLog equal zero, it's still better. So, like, we're still, like, tortured here if we want the best transaction per second. SyncBinLog equal zero is still our best choice. Those are all the graphs on the same page.
11:22
So now about the consequence. So now we get into the real subject of the talk. This is a copy from the previous slide. So just remembering, transaction A, B, C, D. Binlogs are synced up to E. InnoDB syncs up to K, and then we crash.
11:43
Maybe it's time for one question here before I move forward. Everything is clear so far? Okay, so let's see about the consequences of this. So if the master crashed, so we're only interested in operating system crashes. Like, if MySQL D crashed,
12:01
we don't lose anything. Everything is in a buffer. So if a master crashed and we're using legacy replication, so file in position. So the slave is pointing in the binary log that will vanish. So after the crash, MySQL will create a new binary log. And if we keep writing,
12:25
data will be appended to the new binary logs. But the slaves are pointing to something that is now gone. So during the crash, the slaves are not able to connect to the master. But after the recovery, the slaves connect to the master, and they connect in something that doesn't exist, and then replication breaks. So that is obviously not cool, because if, like, now you lose all
12:48
your slaves. You can still write to your master, which is kind of inconsistent. InnoDB and the binary logs are not in sync. You lost some data. The slaves have actually more data than the master, because they didn't lose the last transaction. The master did. So you are not in
13:05
sync, and with broken replication, this is really, really not cool. And if you have lagging slaves that are pointing in an old binary log, like, they will keep going, but they will be, like, they will have corrupted data. So everything is out of sync. Pretty bad. If you're running with GTID,
13:30
so let's say at the moment of the crash, we have GTID up to 60 on the master, and the slave adds up to 58. And then we crash. Now the master goes back up, and because the GTID state is
13:47
stored in the master, the slaves start back in the binary log, the master starts at GTID 50. And so it will write 50, 51, 52. But the slaves has already won to 58. But those are the old
14:05
transactions, and now we have new transactions that have the same ID. So two things will happen here. Either the master will be able to write up to GTID 58 fast enough before the slaves reconnect, and then everything will work but will have corruption, or the slave will be faster,
14:24
it will reconnect while GTID 58 doesn't exist on the master, and then replication will break, because the slave has something that doesn't exist on the master, and that will break. So that's another scenario where things break here, and that's what I just explained. So that's in the
14:47
case of an operating system crash. If the slave crashed and we are using file and position replication, if we are using crash safe replication, basically the replication position is in ENO-DB. So ENO-DB will do crash recovery, that information will be consistent, and then the
15:06
slave is able to restart, to reconnect to the master. It still has binary log that we cannot trust, because now they're out of sync. But there's no data corruption in ENO-DB in this case.
15:21
So if you're running with GTID with binary logs disabled on the slave, which is a feature of 5.7, the position, the GTID position, is also stored in ENO-DB, so there's no problem here. The data is safe, the binary logs on the slave obviously is out of sync with the data, but the slave will reconnect to the master, and the data of the slave will be consistent.
15:45
But if you're running with binary logs enabled, because the position of replication is stored, if you're running a slave with GTID and binary logs enabled, 5.6 or 5.7, the GTID position of the slave is in the binary log. We crash, we lose binary logs, we're out of sync between
16:05
ENO-DB and the binary logs, and so we start replicating at the wrong place. So again, data corruption, either replication will break or we will have silent data corruption. So those are the consequences of running with this parameter. Something in MySQL 8.07,
16:29
GTID position in ENO-DB, unclear. There's hardcore reading about this if you want more details. And now we are at mitigating this. So if you're running with sync binlog not equal to zero,
16:47
basically there's one thing that you need to remember, is that either on the master or on the slave, after an operating system crash, you cannot trust the binlogs. Like obviously,
17:00
you didn't sync them, you cannot trust the binlogs. So what can you do in a situation where you cannot trust your binlogs? So on a master, after restarting MySQL after an operating system crash, what I do is I make sure the master will restart in offline mode equal on. So no clients
17:24
connect to it, no slaves, no clients, and then I decide as a DBA what to do. The best thing is to failover to a slave, but if I do not automate failover, if this is manual, I do not want clients to write to my master and I do not want slaves to connect to it.
17:47
And here I arrive as the DBA, I check if this is an operating system crash, in which case I need to failover, or maybe it's just a MySQL D crash and then I set offline mode equal off, and then I continue, because at that moment I didn't lose anything.
18:08
On slaves, if MySQL restart after such a crash, if you have binary logs on the slave, you need to purge them. You need to completely delete
18:22
the binary logs, because now potentially you have a hole in there. If the slave will eventually be a candidate master, you cannot trust its binary logs, so wipe the binary logs on a slave. If you're running with GTID, a reset master will erase your GTID position, so you need
18:43
to restore this. Intermediate master are both master and slave, so you have to do the two things. When you're running with GTID on a slave with syncbinlog not equal to one,
19:03
so GTID replication is not crash safe. So normally what you would have to do is you would have to restore a backup, and this is very annoying for a DBA to restore a backup. So there's a way to avoid this. It's because file and position in a table can be trusted
19:25
at that moment. So if you're running crash safe replication, not with GTID, with file and position, there's some voodoo you can do here, either if you're running single-threaded or if you're running multi-threaded with slave preserve commit order, there's some voodoo you can do here
19:45
to avoid restoring the backup. And so the idea here is when the slave restarts, make sure that replication doesn't start. You cannot start at this moment. The GTID position cannot be trusted, but the file and position can be trusted. So what you can do here
20:05
is you note the GTID executed of that slave, and then you wipe the binary logs. Reset master, the binary logs on that slave cannot be trusted. Then you restart replication with file and position, which is trusted,
20:21
and then you need to restore the GTID position. And this you'll have to figure out by yourself. It's left as an exercise. If you're not able to figure this out by yourself, you shouldn't do this voodoo. So that's how I run my slaves with sync binlog not equal to one.
20:46
So in conclusion, we saw the consequence, we understood it, we understood how to avoid it, how to mitigate it. So I have a guess, an educated guess, that more and more people
21:02
run with sync binlog not equal to one because of the cloud. Syncing in the cloud is very slow. So you need to run with sync binlog not equal to one if you want decent TPS in the cloud. And so to be fully cloud ready, MySQL should make it easier for us DBAs to run in the cloud.
21:25
So basically I would like auto offline mode after an operating system crash, and auto skip slave start after an operating system crash. And I have three rounds on my favorite feature. So GTID makes things very complicated. So there's a GTID state in the table,
21:46
there's a GTID state in the binary logs, and now there's a GTID state not in the redo log, it's an ENO DB. So there's cleanup to do here. And one last thing about GTID, somebody left but this GTID state is just horrible. So we need a way to clean that.
22:07
Sorry? Just split into multiple lines. Yes. So I have links in my slides. Classic thing. My employer is hiring. Please rate
22:21
my talk. And I have time for questions. Three minutes for questions. Yes, Marco.
22:42
Yes. So the question is, did I do tests about not syncing ENO DB? And the context of that question is that in some next versions of MySQL, MariaDB, there will be features of
23:06
keeping consistency without syncing ENO DB. So like removing a sync instead of having two, you'll double my results. 200 transactions per second will end up being 400, which is very far from 3,000. I haven't formally tested that, but I can do an
23:24
educated dig-est here. So it's a nice improvement to remove a sync in ENO DB. Personally, I think it's not a game changer. We're just changing by a constant. We're
23:43
if I would have to choose between this feature, not syncing ENO DB and keeping things consistent, and making MySQL or MariaDB easier to run with sync binlog equals zero, I would choose this one. But different people work on different parts of the server. So it's still an improvement,
24:06
but I don't think it's a game changer. Other questions? Yes. So the question is about having synth binlog equal to one and TRX commit equal to two.
24:27
So currently there will be inconsistencies between the binlogs and ENO DB. It will remove one sync, like doing one sync on transaction commit instead of two,
24:41
which in the end is not a game changer because we're still waiting for this. So I haven't done tests and thought much about it. My intuition here is it's not in the current context, it's not super useful.
25:00
Any other questions if I have time? There's one last question. Yes. Would it make sense or what's your view on instead of trying to maybe handle maximum
25:24
persistence every single one, or if you don't have the throughput, okay, it's crashed, let's failover. Yes. So the question here is instead of taking each server and being very careful about the consistency of each server, thinking about MySQL as the full distributed system,
25:47
and if the master crashed, failover to the slave, this is the solution in the case where you have an operating system crash on a master if you're running with some binlog equals zero. You need to failover to a slave,
26:03
because you potentially lost data and your slaves are corrupted. Yes, that's the solution. The DBA needs to think about not one server, but the whole distributed system, which is like replication is a distributed system, so we have to failover. Yes. I'm not sure I have time for another question,
26:21
but I will be out of the room if you have questions for me. I'll be just out of the room and ready to answer your questions. Thank you very much.