Tapping Into the Binary Log Change Stream
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 561 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/44568 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSDEM 2019526 / 561
1
9
10
15
18
19
23
24
27
29
31
33
34
35
38
39
40
43
47
49
52
53
54
55
58
59
60
63
65
67
69
70
78
80
82
87
93
95
97
102
103
104
107
110
111
114
116
118
120
122
123
126
127
131
133
136
137
139
141
142
148
153
155
157
159
163
164
168
169
170
171
172
173
174
181
183
185
187
188
193
196
197
198
199
200
201
205
207
208
209
211
213
214
218
221
223
224
226
230
232
234
235
236
244
248
250
251
252
253
255
256
257
262
263
264
268
269
271
274
275
276
278
280
281
283
284
288
289
290
293
294
296
297
300
301
304
309
311
312
313
314
315
317
318
321
322
327
332
333
334
335
336
337
338
339
340
343
345
346
352
353
355
356
357
359
360
362
369
370
373
374
375
376
377
378
383
384
387
388
389
390
391
393
394
395
396
406
408
409
412
413
414
415
419
420
425
426
431
432
433
434
435
436
438
439
440
441
445
446
447
448
453
455
457
459
466
467
471
473
474
475
476
479
480
484
485
486
489
491
492
496
499
500
502
505
507
508
512
515
517
518
529
531
533
534
535
536
539
540
546
550
551
552
553
554
555
557
558
559
560
561
00:00
Binary fileMathematicsStreaming mediaDirectory serviceSoftwareReplication (computing)Plug-in (computing)CodeServer (computing)BitGroup actionConnectivity (graph theory)Boilerplate (text)Software frameworkCartesian coordinate systemPresentation of a groupMultiplication signBinary codeComputer animation
01:33
Binary fileRow (database)PlanningStatement (computer science)Software frameworkLoginCoordinate systemCASE <Informatik>Cartesian coordinate systemDressing (medical)Replication (computing)Server (computing)MathematicsInstance (computer science)Database transactionPlug-in (computing)Binary fileTelecommunicationBitSurfaceMotion captureGroup action
02:45
TopologyDatabaseBinary fileMotion captureInstance (computer science)MathematicsParallel portReplication (computing)Semiconductor memoryDifferent (Kate Ryan album)Numbering schemeSet theoryLoginPhysical systemSlide ruleMiniDiscRow (database)Network topologyStructural loadData storage deviceBuffer solutionGroup actionMatching (graph theory)Regular graphCombinational logicMotion captureCircleTerm (mathematics)Mathematical optimizationBuildingMultiplicationChainSpur <Mathematik>Right angleSequenceComputer animation
05:43
Binary fileFile formatRow (database)DataflowMotion captureMiniDiscMathematicsCASE <Informatik>Replication (computing)Database transactionPlug-in (computing)File formatLoginGroup actionMoving averageBinary filePoint (geometry)MetadataServer (computing)Motion captureRow (database)Run-time systemData bufferEvent horizonStatement (computer science)Type theoryQuery languageInstance (computer science)MiniDiscStreaming mediaTable (information)Level (video gaming)Data storage deviceCache (computing)Multiplication signRight angleSemiconductor memoryNetwork topologyComputer architectureSet theoryComputer fileField (computer science)Commitment schemeCore dumpSequenceContent (media)Bus (computing)Phase transitionKey (cryptography)Form (programming)Descriptive statisticsData recoveryMessage passingRun time (program lifecycle phase)Integrated development environmentData loggerIntrusion detection systemComputer animationProgram flowchart
10:49
MathematicsPersonal digital assistantMotion captureOracleVideo trackingCASE <Informatik>Proxy serverMoving averageDisintegrationLogic gateBinary fileData recoveryDataflowComa BerenicesLocal GroupPlug-in (computing)Replication (computing)Database transactionServer (computing)CASE <Informatik>Motion captureMathematicsBinary fileInstance (computer science)Database transactionLogicStreaming mediaData recoveryDifferent (Kate Ryan album)Fitness functionGoodness of fitMessage passingTransformation (genetics)File formatPower (physics)LoginProjective planeData miningRevision controlChainTrailServer (computing)Point (geometry)Replication (computing)INTEGRALSoftware frameworkGroup actionVideoconferencingProxy serverPhysical systemMoving averageDecision theorySquare numberState of matterMetadataQuery languageMultiplication signSet theoryPresentation of a groupReverse engineeringOrder (biology)Connected spaceSlide ruleIntrusion detection systemProcedural programmingFormal languageBuffer solutionDatabasePropagatorMereologyEvent horizonProcess (computing)Plug-in (computing)Module (mathematics)Data storage deviceTap (transformer)Computer architectureTable (information)DiagramNatural languageBuildingData integrityPhysicalismComputer animation
16:40
Replication (computing)Plug-in (computing)MathematicsLocal GroupCASE <Informatik>Database transactionServer (computing)DataflowMotion captureTuring testPhase transitionPoint (geometry)User interfaceImplementationState observerInformationSource codeDatabase transactionCommunications systemCartesian coordinate systemReplication (computing)Server (computing)Moving averageDiagramCommunications protocolPhysical systemOrder (biology)Roundness (object)Data bufferCASE <Informatik>Decision theorySemiconductor memoryMobile appBuffer solutionView (database)SynchronizationConnectivity (graph theory)Streaming mediaParameter (computer programming)Motion captureData storage deviceBinary fileGroup actionCache (computing)Procedural programmingRevision controlMultiplication signComputer fileHookingRepository (publishing)Point (geometry)Message passingMathematicsException handlingBoilerplate (text)Plug-in (computing)CodeCore dumpPlanningState observerPhase transitionRun time (program lifecycle phase)MereologyFinite-state machineCommitment schemeLoginTelecommunicationResultantTotal S.A.Program flowchart
22:27
Block (periodic table)MIDIRecurrence relationBinary fileServer (computing)File formatLoginDifferent (Kate Ryan album)Set theoryMultiplication signBuildingPower (physics)Software frameworkBlock (periodic table)Instance (computer science)Plug-in (computing)Binary fileConnectivity (graph theory)Physical systemLevel (video gaming)Sinc functionWindows RegistryMusical ensemblePresentation of a groupMechanism designCommunications protocolProgramming languageDevice driverProcess (computing)PlanningMathematicsData typeJSONComputer animation
25:41
Euler anglesComputer animationLecture/Conference
Transcript: English(auto-generated)
00:06
Alright, hello everyone, my name is Luis and I've been working on MySQL replication for a little over 10 years now. For those who don't know me, I'm Luis, I lead the replication team, I've been around
00:25
for 10 years, I'm based out of Portugal, I've spoken here at FOSM a few times already, it's one of the places I like to come and speak about much more technical, deeper kind of
00:41
things. And I could talk about replication for hours, I think that some of my colleagues are well aware of that. But today I'll just be talking about a very specific thing, which is the binary log and
01:01
more specifically how the binary log is created inside the server, how it is utilized by different components inside the server including replication and group replication and how we can look into this framework that exists inside the server and maybe end up this presentation with a little bit of boilerplate code if you will to create maybe some plugin that
01:28
can actually tap into the server and extract this. So I'll start just, of course, I'll talk a little bit of MySQL replication but not that much, I promise.
01:40
And then I'll describe the binary log, I'm pretty sure that everybody here knows what the bin log is for short. Actually one raise of hands, who here uses statement-based replication only? Who here uses statement-based replication?
02:01
And who here uses role-based replication? The rest, right? So we will also be focusing on role-based replication for obvious reasons when it comes to change data capture using the bin log. And then I'll try to go into this looking into a couple of use cases, the MySQL group
02:23
replication plugin for instance, which extracts the changes from the server and pushes it out to this Paxos communication framework and does all this coordination and then decides something and tells the server to commit to reward transactions and so on. And then I'll conclude the session, this is basically the plan for this 20-minute session.
02:45
So MySQL replication, this is a very common slide that I usually pull up on our sessions. Replication is very simple, master-slave and then there's a log that is shipped around. This log contains, records the changes that happened in the primary or the master and
03:04
these changes are then propagated to the slave and the slave applies them and if it generates its own change log, its own binary log, then you can build all these different topologies that for instance Peter was talking about earlier. For better or for worse, MySQL replication is almost like a Swiss army knife or a loaded
03:25
gun as Peter said as well. So you can hurt yourself plenty with it, but it can really get you out of some nasty problem situation that you may run into. The interesting thing that we're talking about here today is the binary log and
03:41
the master generates a binary log, the one that is shipped around and the slave also generates the binary log. There's the persistent buffer which we typically call the relay log on the slave which is basically get the changes, store them on disk while I'm trying to apply the log and
04:04
just persist them on disk while this is going in case so that we don't saturate for instance memory. This was very important a long time ago, but not so much nowadays because nowadays we have parallel slaves and different parallelization schemes with write sets we can almost come to an almost optimal execution history in terms of parallelization.
04:25
So this persistent buffer is not so much needed anymore as it used to be, but it's still there. And we can build multiple replication topologies, you know, simple primary secondary or master slaves, a chain or a tree or a star or a circle or a group now with group replication
04:48
where we can have external secondaries from the group so you can mix and match group replication with regular asynchronous replication or you can have an external primary and then replicate into a group.
05:02
And why is this important in this session? It's important because we can have all these combinations of topologies together and therefore we can have dedicated instances of MySQL where we can actually play around to do some interesting things related to change data capture where we don't disturb the rest
05:24
of the topology. Over the years I've seen a lot of people do these kind of things like having a dedicated slave with row-based replication enabled to be able to mine the binary log, transform it, load it into some other system like Hadoop or something else.
05:43
Okay, so the binary log. The binary log is used for replication but not only replication, point in time recovery, integrating with other technologies, rolling upgrades is a very important thing when it comes to rolling upgrades because it makes it so easier to just do a rolling upgrade
06:00
on your topology and more. And as I said, we're focusing on row-based replication format because it's a feature-rich format to be able to load data, capture the changes that happen inside the server as they are recorded in the bin log.
06:20
Over the years the bin log or the metadata that we put into the bin log format in row based form, sorry, over the years the metadata that we put in the bin log in row format has been extended so that we also make it easier for you guys that are actually implementing these special cases of capturing and loading and transforming the bin log and loading it
06:44
into something else becomes easier. For instance in 8.0 we have more metadata when it comes to the table definitions so the table map log event for those that know the contents of the bin log has additional information such as the name of the columns, such as which fields are the primary key,
07:04
well, whether this column is signed or unsigned and so on, the character side and so on and so forth. And at its core it's really just a sequence of events, we call them events, which, you know, there are some type of events that are control events like rotate format description,
07:24
GTIDs and so on and then there's the data itself which is carried over as, you know, a query log event or row-based row events. But in the end it's a sequential history of the execution that happened on the master with the changes that were produced by then.
07:43
And how is this log, change log formed? First of all, and mind you that again I reiterate we're focusing on row-based replication, the way it is formed is that when a transaction begins and the statement is executed at the
08:02
handler level between the SQL layer of MySQL and the storage engine, MySQL, as many of you probably know, is a layer that is built on a layered architecture where you have this SQL layer which has the runtime environment and so on and so forth and then there's
08:21
the handler that separates the SQL layer from the storage engine itself. So when the data is propagated, you know, from the SQL layer to the storage engine, we intercept the changes, the role changes, the record changes, we capture those, we put those changes in the memory cache, in the memory buffer and we buffer this until the
08:45
commit comes. When the commit comes, what happens is that we get this prepared in the storage engine, so internally we run a two-phase commit for instance between the bin log and the storage engine, but right before we flush these changes to disk, to the file, to the
09:03
bin log file, before we persist these changes, this cache in the bin log file, we notify a plugin or plugins that could be listening to these changes. Think about group replication for instance. At this point in time, before we flush the transaction to disk, we notify a listener for this change, group replication for instance,
09:24
and group replication takes this data and pushes it out into Paxos. By the time it gets back from Paxos, okay, transactions is actually committed, so we flush it to the bin log and then we notify plugins that might be listening that, you know, I want to be notified once the transaction
09:43
commits. But the important thing here is that, to note, is that the capture is all done in memory and by the time a transaction commits, we share these captured changes with plugins or other listeners inside the server. And in this way, you can think of it as a,
10:00
you know, a change, an event bus for data changes that is running inside the server. And this stream of changes materializes as a set of files on disk, which we typically know as, you know, when we talk about, the bin log is really very much overloaded nowadays,
10:22
so it means many things. And usually people, when they talk about the bin log, it's a set of files that you can go in, inspect what their content is with MySQL Binlog2, like Shlomi was talking earlier here today as well. But at the end of the day, the change stream materializes as a set of files on disk. Yeah, so there's the MySQL Binlog2 to actually
10:47
go in and open these files and see what's in there. So use cases for change capture using the bin log. I'm going to talk about, or show some examples of some projects that build on the bin log to do some kind of change propagation
11:05
or some kind of procedure automation that really benefits from this log. I'm not going to talk about things like flashback because Shlomi already did and I don't have it in the slides, but that's another use case that you can use this change log
11:24
to actually replay it backwards and do these things like recovering, if possible, recovering an instance to a certain point in time by replaying the change log in reverse order. So it's actually like a compensation action rather than reversing.
11:44
Okay, change tracking. I think René is somewhere here. So René in proxy SQL, for instance, taps into this change stream and in this proxy SQL case, the process that mines the stream is actually ignoring the changes but capturing
12:06
the metadata that is in the change stream. It's extracting the GTIDs that run through this change stream and collecting a global knowledge of the system to know and understand in which state is each of the replicas in my system so that I can make decisions
12:24
about the routing of queries to the proper replica. This is how, for instance, proxy SQL. This is very roughly how proxy SQL does implement consistent reads across a set of replicas.
12:43
Shlomi also has this tool called GHOST, which stands for GitHub's Online Schema Transformations, which basically is a tool that tries to do a user-land online schema change by copying data from one version of the table to the newer version of the table
13:05
and in the meantime build on, tap into this change stream that is being executed while this process is running and trying to collect these changes to later on apply them to the schema transformation as well and do all this in an automated fashion and online.
13:25
I mentioned already online rolling out process is a really powerful use case or a really good fit for logical replication or change stream based replication where you can have
13:40
servers in different versions. They probably have, for instance, different on-disk layouts, different physical layouts, so you can rely on the logical stream to actually make sure that you have a sound way forward with respect to rolling out different versions of the server. Data integration. This is, again, a very common use case. I put here Oracle Golden Gate,
14:06
but a lot of people do it with, for instance, Maxwell and other kinds of framework where they tap into the bin lock stream, connect, for instance, as a fake slave to a master. They bring the bin lock, they process it, transform it maybe into some more language neutral
14:28
format, JSON or re-encoded in protobufers or something like that and push it out into some other message, but in this case, for instance, Kafka or something else. Some people actually
14:41
use that as well to move data back and forth from different databases, from MySQL into Oracle or from Oracle into MySQL or from MySQL into some other database and back and forth. And for point-in-time recovery as well, right? We can take a snapshot, we can rely on the bin lock to make it so that we can take that snapshot and roll forward some
15:05
of the change log that happened after this snapshot up to a certain point in time. So we can roll forward the changes just like Shlomi did on his flashback presentation. She said, I want to roll back, compensate my or revert the state of my server up until this point.
15:24
We can do that, of course, in the opposite way as we all know. So advanced data change data capture. So coming back to this interesting diagram here, you know, we capture the changes while the transaction is executing. Once the transaction
15:43
is about to commit, we notify that it's going to be committed and then we flush these changes to the bin log and then we notify that these changes have been flushed and committed in the bin log in the storage engine locally. So let's have a look at the group replication use case. Group replication is a plug-in,
16:02
it taps into the server, it has all these, you know, fancy diagrams and stack, very nicely layered architecture, different modules for capture, apply, recovery, conflict handling and so on. And it has this very interesting module here, which is the capture
16:22
part. And the capture part is really the guy that implements these hooks to tap into the server and say, when you load the plug-in, it says, okay, I'm going to register myself as a listener for the events that are propagated. And then at commit time, I want to be informed that these
16:40
events have happened. So I want to take a part of the decision of what's going on on this transaction execution timeline. And I want to know what was actually changed. So it intercepts the change buffer and sends it out to the group in the case of group replication. execution takes place, it captures it, then it pushes it further down in the stack
17:05
to the group communication engine. And at the very low end, a Paxos consensus round is done, is reformed, you know, majority acknowledges, we all make the same decisions everywhere. We have total order delivery in the system and then everybody advances in the same order.
17:24
This is at the end, it results in this distributed replicated state machine. But at the end of the day, what I'm really interested here is to show is how does group replication intersect these changes.
17:40
So the intersection takes place here, right before we flush to the bin log. This is, not at this point in time, the changes are still in memory, in this buffer, in memory buffer. It's an IO cache, for those that have actually looked into the code, it's an IO cache. And this means that the cache is mostly in memory, so it has a fixed size
18:07
memory buffer. If the transaction changes are bigger or larger than this buffer, it spills to this disk, so it swaps. But at the end of the day, we can consider it just as a memory buffer. So we take this buffer and we share it with group replication.
18:26
This is the same diagram with slightly more detail. We execute, we commit, we prepare, we start the two-phase commit protocol inside the server. Then we take this, we replicate in
18:41
the before commit hook, we give it to group replication. Once group replication says yes, okay, you're good to go, we commit the transaction. Otherwise, we roll back the transaction. Maybe there was a conflict, maybe this guy was sent out of the group, maybe something wrong happened. But if the group replication says yes, the commit procedure continues, we flush
19:04
and sync to the bin log, we commit to the storage engine. Committing into the storage engine means we externalize to the SQL layer what has been changed. So all logs are released and now you can see you have a different view of the data. When we flush and sync to the bin
19:25
log, it actually means that we externalize for consumers of the bin log. So slaves will already see it and other components in the system that are listening to this replication stream will already see it. When we propagate to group replication, it means that we externalize
19:41
already there to the members of the group. So that's one minute, okay. So that's when the actual commit takes place. So it's externalized to the group here as well. And what are the internal APIs for this? If you go into the source code and you look,
20:02
you'll notice that group replication implements these APIs. The before commit hook and then the parameter for this callback which contains these caches that contain the changes.
20:20
So it gets this notification with the caches and the group replication plugin can go in, open these caches, learn the changes and so on and so forth. This is the oversimplified version of the before commit hook. It just basically says, well, I got the caches, I did whatever I needed to do with them and at the end of the day,
20:45
I send it out to the group. I issue a send message to the group communication system. This is, at its core is that simple. So the key point is that the capture and extraction are the same as for regular replication except that it happens earlier in the commit phase.
21:07
And just to finalize this session, there is this observer plugin which is kind of like a boilerplate code that is also shipped in the source code of the server which just implements a very dummy
21:21
capture plugin. So this is the way, I mean this is the source, kind of like the summary of the source code of this plugin. It just implements the before commit hook and the only thing it does is increments a counter. Then at initial, okay, so this is actually just the before
21:42
commit hook so it implements a counter and the way it works is that when this plugin is loaded, it registers this hook in the server so that it intercepts this diagram that I showed earlier so that it intercepts this notification and then do something. In this case,
22:01
it just does before commit counter plus plus so increments this counter. And that's what I said basically. And if you're really interested into going to the, looking to the source code and whatnot, of course you can. It's, you just go to GitHub's MySQL server repository and
22:21
look at this replication observers example .cc file. So to conclude, the binary log is a very useful building block. We see it every day. Today, we've had, even today we had the presentation for instance explaining it. It is not just for replication and
22:48
since, you know, very, since very long time ago, since early 5.7, we have this framework where plugins can tap into and listen to notifications. One comment before I close,
23:02
we didn't have back then but now we have in MySQL 8, we have this beta infrastructure which we call the service registry, where it's so much easier to register, you know, services inside the server so that we have another different set of consumers that, you know,
23:20
just go to the service registry, look up the service that they want to listen to, register as listeners and so on. And we are thinking about moving these hooks that I was explaining here into the service registry to make it even more general for you, me, anyone write a plugin to actually make use of this infrastructure in a much easier way going forward as well.
23:45
So that's that. Thanks for, you know, listening to me and this is really cool stuff. I really hope you enjoyed it as much as I do. Okay, thank you. Any questions? Okay.
24:11
The question is, maybe you just answered it in this last sentence. I think the problem we see today is like you said, there's many drivers or like many processes that will tap onto the
24:21
binary log and in different programming language and anytime the binary log format changes or there's a new data type or whatever, everybody needs to update their code, 12 different programming languages. So the question is, will there be a mechanism where we will just tap onto the plan protocol or some API and just get the changes like JSON and whatever, something that
24:44
it's easy to repeat the question. So the question is really, will there be a less opaque format to the bin log so that we can make full sense out of it? That's the question. We are really looking into that because the bin log is actually an API to
25:03
integrate MySQL into other systems. Someone earlier today was on this stage saying, we have this, this is very good, we have that, that is very good. The real power comes when we combine these two things together. So any two components, MySQL, some other framework,
25:21
they need to speak to each other to provide combined value and we see that many times. So we are looking at making the bin log format much more interoperable, if you will. All right, so if you have more questions, just grab me outside and I'll be happy to talk to you. Thank you.