MySQL 8.0 Performance: InnoDB Re-Design
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 644 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/41495 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | |
Genre |
FOSDEM 2018556 / 644
2
3
5
6
7
8
10
12
16
18
27
29
30
31
32
34
39
46
47
48
55
57
58
61
64
67
76
77
80
85
88
92
93
98
101
105
110
114
115
116
118
121
123
128
131
132
133
134
140
141
142
143
146
147
149
162
164
171
173
177
178
179
181
182
184
185
187
188
189
190
191
192
200
201
202
204
205
206
207
211
213
220
222
224
229
230
231
233
237
241
242
243
250
252
261
265
267
270
276
279
280
284
286
287
288
291
296
298
299
301
302
303
304
305
309
310
311
312
313
316
318
319
322
325
326
327
329
332
334
335
336
337
340
344
348
349
350
354
355
356
359
361
362
364
365
368
369
370
372
373
374
376
378
379
380
382
386
388
389
390
393
394
396
400
401
404
405
406
407
409
410
411
415
418
421
422
423
424
426
427
429
435
436
439
441
447
449
450
451
452
453
454
457
459
460
461
462
464
465
470
472
475
477
478
479
482
483
486
489
490
491
492
493
494
496
497
498
499
500
501
503
506
507
508
510
511
512
513
514
515
517
518
519
522
523
524
525
527
528
534
535
536
538
539
540
541
543
544
545
546
547
548
550
551
553
554
555
559
560
561
564
565
568
570
572
573
574
576
578
579
580
586
587
588
590
593
594
596
597
598
601
603
604
606
607
608
610
613
614
615
616
618
619
621
623
624
626
629
632
633
634
635
636
639
641
644
00:00
System programmingDatabaseBenchmarkSource codeQuery languageSubject indexingMathematical optimizationScalabilityDatabase transactionPatch (Unix)Server (computing)Vector potentialPoint (geometry)Block (periodic table)Data managementCache (computing)Partition (number theory)Fundamental theorem of algebraLocal GroupThread (computing)Data storage deviceSpeicherbereinigungOcean currentGraph (mathematics)Exception handlingFile systemMultilaterationData storage deviceCache (computing)Reading (process)Key (cryptography)Computer hardwareSound effectThread (computing)Subject indexingSpacetimeQuery languageMultiplication signEndliche ModelltheorieCommitment schemeVirtual machine2 (number)Data structureArithmetic progressionMathematicsScaling (geometry)Point (geometry)Complex (psychology)Computer clusterLoginDatabase transactionCore dumpDifferent (Kate Ryan album)Data managementReal numberProxy serverPartition (number theory)Operator (mathematics)WorkloadCodeBlock (periodic table)Linear regressionFlash memoryFilm editingTable (information)Adaptive behaviorHash functionTheory of everythingRight angleRow (database)Content (media)WindowDisk read-and-write headWhiteboardDesign of experimentsStaff (military)MiniDiscoutputResultantFile formatFood energyFiber (mathematics)Square numberProduct (business)Computer fileInsertion lossComputer animation
07:31
Process (computing)Channel capacityConfiguration spaceLine (geometry)Computer configurationExecution unitScalabilityBefehlsprozessorSocket-SchnittstelleWritingData managementTable (information)Network socketFlash memoryGame theoryLimit (category theory)WorkloadGodData compressionWeb pageDefault (computer science)Military operationGame theoryBlock (periodic table)Data storage deviceBefehlsprozessorArithmetic progressionSemiconductor memoryDatabase transactionMultiplication signNetwork socketWorkloadWeightRight angleIdentity managementExecution unitDisk read-and-write headNumberNatural numberDifferent (Kate Ryan album)Web pageLinear regressionWechselseitiger AusschlussData compressionOnline helpPhysical systemThread (computing)Software testingBuffer solutionCentralizer and normalizerMaxima and minimaMiniDiscRevision controlData managementFlash memoryCodeOcean currentConfiguration spaceSoftware developerResponse time (technology)LoginPoint (geometry)MultiplicationTheory of everythingFrequencyLine (geometry)Digital electronicsDefault (computer science)ResultantComputer configurationLevel (video gaming)PrototypeTouch typingRead-only memoryMehrplatzsystemGroup actionForm (programming)Reading (process)MathematicsEndliche ModelltheorieReal numberVideo gameFood energyCASE <Informatik>Operator (mathematics)Row (database)WritingComputer clusterDependent and independent variablesSynchronizationTunisComputer animation
14:46
Software testingWorkloadCASE <Informatik>Process (computing)IntelWeb pageBefehlsprozessorEquals signMathematical optimizationDisintegrationDatabase transactionContext awarenessHill differential equationGauge theoryCodePatch (Unix)Bit rateLoop (music)Linear regressionNormal (geometry)Software bugFeedbackGroup actionSystem callProcess (computing)Database transactionEndliche ModelltheorieReading (process)Computer-assisted translationDoubling the cubeMultiplication signWorkloadLogicArithmetic progressionSoftware bugContent (media)Artificial neural networkReal numberLink (knot theory)Point (geometry)CodeFlash memorySoftware testingHydraulic jumpOcean currentPareto distributionLevel (video gaming)Object (grammar)Wechselseitiger AusschlussData storage deviceSelectivity (electronic)Structural loadWeb pageRule of inferenceScheduling (computing)Cartesian coordinate systemLoop (music)Programmer (hardware)Patch (Unix)WhiteboardVideo gameProduct (business)Different (Kate Ryan album)Operator (mathematics)CASE <Informatik>Theory of relativityFunction (mathematics)Thread (computing)Food energyRight angleGoodness of fitRevision controlCanonical ensemble2 (number)BitStaff (military)PlastikkarteAlgorithmMusical ensembleProgrammschleifeTheoryGroup actionComputer animation
22:01
AreaDoubling the cubeMultiplication signData storage deviceWeb pageRight angleComputer animation
22:53
SoftwareRight angleMultiplication signData managementDatabase transactionFlash memorySemiconductor memoryIntegrated development environmentBlogData storage deviceMiniDiscStatisticsCommitment schemeBit rateComputer animation
24:47
CollaborationismService (economics)Program flowchart
Transcript: English(auto-generated)
00:06
Hello, everybody. So let's start, I have only 20 minutes to tell you all the fun stuff. This about me, so very short.
00:21
Just working full time about on my scale performance. Extremely fun stuff, and I can you tell much more so later, yes, I have only 20 minutes about. I'll try to do it in 15 and so what are my main problems?
00:41
So in fact, if your problems are something, how to tune your query or use in the right way my scale and so on, everything you can read from TFM, it's not my problem. So my problem is it's about what you cannot fix, which is broken by design or not work by design
01:01
and so on. So many problems are known, many problems we just discover and well, work in progress. So it's a complex progress here. Historically, so from 5.5, my scale 5.5 was, who already tested my scale 8.0, fine guys.
01:24
Who already moved to 5.7 at least. So 5.5, you know, we deliver it just, well, some fixes which were already known. In 5.6, we started to do some deep changes, which bring huge pain because there was many regression,
01:43
many differences and so on. The most painful point in 5.6 was and writes were faster than reads. So in fact, if you want to read faster, you need to send some writes and they will unlock your reads and you will read faster. So it was completely odd, dumb, what you want.
02:00
In 5.7, finally, we fixed reads. So reads become faster, but we started to lose in efficiency. So there is always cost on something. We always balancing the code to get the most optimal stuff. So in 8.0, we are fighting and I think it will be main find for many years now
02:21
for efficiency to get the best possible performance on the same hardware. So we are not running for the bigger hardware. You should be more efficient in the same way. So you will see some flame graphs. So it was huge progress on 5.7. You can see here, once we delivered on the read only,
02:42
monstrous result, we were so happy. Since this big machine was upgraded from 72 cores to 96. So we are happy to get more than 2 million square in now. So okay, it's real numbers, it's possible. But the difference between 8.0 and 5.7, it's not really big.
03:01
And in fact, well, we don't care about this anymore because we already saw that we can do 2 million five. So for read only, it's fine. We don't touch this. So there'll be probably some regression because we added new stuff. Okay, but our main problems, it's everything which was reminding from 5.7.
03:24
So I will just add potential fixes, which we'll have here and speak about the problems. So on read only, there is still remaining block locks, for example. So if you read constantly the same rows, you will have contention.
03:42
So you will go just slower. The other, the work around here, it's just to use query cache or proxy SQL query cache. There is no more query cache in 8.0. So proxy SQL is the best solution. Look up on secondary indexes on InnoDB can be many times slower than primary key.
04:01
So it's the main work around just to use primary key or while we are working in another solution to speed up all this stuff. There is also adaptive hash index. If you enable, but the problem, as soon as you have writes, it can slow down your stuff. So these are hot topics for us here. And for UTF-8, it's extremely faster now in 8.0.
04:23
So before, for example, in 5.7, you can be 10 times slower if you use UTF-8. So at least in 8.0, it's only 10, 20% slower than Latin-1. Huge progress here. For double write, so this is all read write problems.
04:41
Double write is expected to be fixed at 10.8.0, so I will not speak here. I will tell you about the redo log changes we are doing, transaction logging, and log management. So I will speak to you about CATS, so I'm coming with 8.0. Transaction isolation was still work in progress. There is a huge potential fix here
05:01
on update performance directly related to redo logs, and inside performance, we are working here. Well, as soon as we keep B3 cached, like if you can use partitions or different tables, you can go very fast on insert. Otherwise, well, B3 is impacting you.
05:20
And for purge, if purge is lagging in your production, so at least with 8.0, you can truncate undo, so you can truncate the space which is your garbage collection effect. So to be short, so we touch it,
05:40
so we try to select the most killing problems that we have. So one, the most killing one is redo log, so as soon as you optimize it, everything, and then you cannot write on redo log as fast as you want, you unblock it. So this is the final bottleneck that we hit. Well, except if you hit some other problems. If you don't hit them, so this is one.
06:02
So for this one, this one, we attack it to fix an 8.0. Another one is every IO-bound workload. So in InnoDB, we have global lock currently. As soon as you start to read something from the disk, every IO operation taken the global lock.
06:20
So of course, it cannot scale. And as you have faster and faster storage, you cannot do faster with faster storage. And so this is related to file system locking, and also so for role locking. So we take a contribution from Michigan University,
06:41
so I will speak about this later. So now about the redo log. So what happens inside? So well, I suppose that you know, then we have a transaction commit. You know how to use them, so you flash on every commit or not, or once per second. So well, it was three years ago, we discovered that we can go faster
07:01
even if we flash on every commit. So in fact, the storage is not a problem. All the problem was about the locking inside. So all user threads are fighting to write a redo log. And in fact, this was the old model. So you see the global lock, which is blocking everybody.
07:20
And the new models, and we have dedicated log writer thread, log flasher thread, and notification threads around, so just simplified structure. So users are not block it anymore. They are writing directly to the log buffer, and in parallel, we are flashing all this data to the disk. So in fact, we are block it only
07:41
by your storage performance. So faster you can write, faster will your redo log work. There is no more grouping, so we don't wait. It's natural grouping by your flash speed on the disk. That's all. So you will go as fast as your storage can go. So this code is extremely well instrumented.
08:03
So you can know exactly how many weights you have, what your thread are doing, so what happens inside, and all configuration is dynamic. So you can change whatever you want inside and see it, even resize your buffers, log buffer, for example, live, or even stop all the redo log if you want.
08:23
So this is multi-thread model now, but you have a trade-off. So with multi-thread, you never can be faster than single-thread, which is just doing alone. So it will do write, if sink, write, if sink, without any weights, without any synchronization. So of course, it will be go faster
08:41
than the right place with threads. So there is a trade-off, and the only option, so we are disappointed, then even driven system cannot be as fast as spinning. So spinning is the most efficient way, anyway. So with this painful, probably later,
09:00
we will reinvent something more efficient, but currently, only spinning help him to go as fast as single user doing the same actions. So you can imagine here, so what happens, well, just to give you an example, so if we do nothing, if we don't use redo log,
09:21
so you can see, this is where going, this is a red line, it's about the current 8.0. So if we don't do these changes on redo log, you see, we don't reach the levels, so this is, let's see me, this is the highest level that we get with spinning,
09:43
but spinning, as soon as you eat a lot of CPU, so then you go down. So in fact, you need a balance between spinning and even driven, and so, currently what we decided, so it will be adaptive spinning.
10:01
So in fact, okay, you can tune at least to say, okay, as soon as I reach this level of CPU usage, I don't want to spin anymore, only as soon as my disk become slower than a given response time, I don't spin anymore. So at the end, it will be just auto-tuned, we will auto-discard what happens on your system, and you will don't touch anything.
10:21
But as a first release, well, we prefer to give you some tuning points here. So what we have is this, for the first time, we're presenting the result with transaction commit one, because before it was never possible, as soon as you use one, everything was slower. Now it become faster.
10:41
So 8.0 become faster than 5.7 and 5.6, in the pure OLT period write test, and much more faster yet in pure update, so when you bombard and when you have heavy updates workload, so the difference you will see. What is amazing here, then, you see in 5.7,
11:02
we got huge regression here, and 5.7 is slower than 5.6. So in fact, we are two times faster than 5.7, fine, but we are getting back this regressions and we got since 5.6. So this was related to all this work about read-only improvement and so on.
11:22
So, we are still not scaling, why? So because read-a-log is just the first step and we have next layer log, so all this transaction logging, log management and so on. So just to give you an example, how it should be, so this is our prototype,
11:42
so in current development version, and it's, so if 8.0 is not scaling on one, so there's one CPU circuit on the left side, so this is one CPU circuit, and here in fact, it's two CPU circuits. So we can reach 400,000 updates per second,
12:04
which is enormous number, so never seen until now, but well, this work is in progress. I hope it will be fixed soon once we deliver 8.0. Now, what about IO workloads? So well, any IO-bound workloads
12:23
were blocked by storage until now, but now you have a huge game changer with flash storage, which has come and become faster and faster and faster, and in reality, you have max throughput on your flash storage, but your real performance today is driven
12:43
by IO operations per second. So in fact, this throughput is limited, but you can divide it like a memory access, so it's not like before, then we tried to do bigger IO, to read more on whatever. Now you can read small IO and you will still match the same throughput,
13:01
but you will have more operations per second, especially if you need to read few records from big block, so it's smaller blocks will give you better performance, but in fact, what happens on InnoDB by default, we have 16K page, okay? So compression topic is very popular,
13:24
so we say, okay, let's compress, imagine we can compress it four times, so we will be able to read four times more because so page of 16K, compress it to 4K, and with the same throughput, we can read four more pages. Great, the problem is we have exactly the same buffer pool,
13:44
so your memory is not four times bigger, so once you uncompress your data, it's still the same useful data set, so in fact, you can read faster, but you cannot use your data faster because before to read, you need to process what you already have, otherwise why you read, right, this data.
14:02
And what happens if you will just, instead of 16K page, use 4K page. In reality, for the same memory, you will have four times more useful data, and here really will go fast, so of course, this works if you need only few data from the same page, not if you need the whole page,
14:21
otherwise, okay, this should be okay as well, but all the story is not possible just because we have this global lock mutex, so global lock for every eye operation, so as soon as you're starting to read faster, this global lock is blocking you, so you cannot read it anymore, and so the good news is that in 8.0, we fixed this,
14:46
so to validate these changes, so we got the chance to use the latest Intel Obtain drive which is alone able to deliver, you could imagine, so just one single thread,
15:00
doing pure IO can just read with one gigs per second, so 1,000 megabytes per second, one single thread, so in fact, using two drives like this, in theory, with 4K page, we're able to do one million reads per second, but is it true, this sequence?
15:21
And in fact, yes, so we're doing more than one million real IO-bound selects per second, so it's huge, so it's pure IO-bound point select here, and the same huge jump in updates, because in update, it's much more expensive,
15:42
so we can have to read the page, update the page, write the page, and so on, so we're constantly doing IO operations, and well, this is the progress that we expected from a long time, and finally, everything coming good on the same time, so we have storage solutions which are coming with flash from any vendors,
16:03
and we have code which will work with this, and for the last point, I wanted to tell you about the cats, so in MySQL, it's called cats, initially, it was called vats, and cats are just contention about transaction scheduling,
16:21
so was invented by University of Michigan, and adopted in MySQL now, so available since MySQL 8.0, so the idea is, well, looks pretty simple, so in fact, not all transactions are equal, and you have some transactions which will lock more data, some less,
16:42
so some objects are more important, less important, so in fact, this is a simple schema, I put you links also here to read more about, so it's very long paper, scientific paper, explaining all the logic with many examples and so on, so in fact, how to decide, traditionally,
17:01
we just using FIFO and first coming, first unlocked, but in fact, if you do it in more smart way, so you don't unlock the transaction which came first, but transaction which is blocking more others, you have better performance, but in fact, all the story, it was much more fun than this,
17:23
because it's turned into a real detective story, in fact, so there was a claim about huge performance improvement, but as soon I started to test it, so any probe test did not show any difference, no gain, zero, or you're just going slower,
17:41
so well, we started a long investigation and discussion with guys from Michigan, well, because the first impression was, they are just kidding, you know, there is nothing real, just it's fun something around, and they know MariaDB already applied the patch, so everything is working, shit, we did not see any results, okay,
18:01
so we started to understand what they want to solve, finally, so I found the way to build the scenario which shows the difference, so then it was big UP, finally we can see what could be improved, and as soon we started to see the workload, Sonya started discovering bugs and bugs in the patch,
18:21
and then it started to loop, in fact, on the remastering, fixing, retesting again, do we have again, we lost again, we got again, another bug is open, and so on and so on, so well, in fact, well, it's probably with some nine months of this, in many loops, before fixing everything,
18:41
and at the end, it was still unclear in which way you should use which algorithm, and again, DBA cannot sit down and look on the workload and say, okay, let's five minutes around this one, or now I switch to another model again, so finally, we've come with solution
19:00
which after detecting the problem, and then you will switch from FIFO or CATS according what is better for you in your current workloads, so we'll just discover how many locks you have and decide what is better, so it helps everywhere when you have rule or contention,
19:20
how you can recognize it, just follow your show engine mutex output, and you will see if you have locks on your current workload or not, so in fact, you need to, well, somebody never monitors the production, well, bad questions, everybody monitor production, right? So well, you monitor this all the time,
19:42
so of course, you will see your spikes, and you will see, okay, it's your case, so you will be happy with A2, and so here is example with a sole level Pareto distribution, so means you will have artificial contention on data, and many threads, many users will fight
20:02
for the same data, so with a growing load, you can see them, so without this algorithm, you lose performance with growing load, with this one, at least you solve something, okay? But you should be also realists, and well, if you write your application, and you lock every time, everything,
20:21
so well, you're just bad programmer, right? So you try to avoid locking, in fact, so it's because it's by designing your application, so you don't want to create problems, and main problems are coming mostly because in InnoGB, we have repeatable read transaction installation, okay? We want to move to committed read.
20:42
The problem is, and this currently, it's creating even more problems because of transaction locking, but this is a way what we will do, but in fact, even this case, when you can use a read commit, you don't see any difference, you know? You use what, or you don't use, or whatever,
21:03
everything is fine, so while I think once we'll deliver all the fixes that we want, everything will be transparent and equal. You'll be much more happy, even, well, than before, and with this solution as well. So, go to Axon, download the stuff,
21:22
and the most important point, have fun, because otherwise, we are doing Axon, and our job is stupid. So if you don't have fun, this is the point. Thank you, I'm on time, I see.
21:41
There's no time for a question. Yes, so why I will set up Giuseppe, you can answer a question and just get there. So, any questions? Yep, yep. Well, so in fact, the main problem with double write, you see, is,
22:08
currently, so you have, you just have a small double write buffer, which is very small, and especially, then you have locking inside. So in fact, what is the problem with double write?
22:21
So, this is the only protection we have today, because of corrupted pages, and what's amazing, then, no storage vendor on Linux can support you, then you will not have corrupted pages. There is no support. So in fact, you need to write it twice, every page, and the only problem is then,
22:42
you need to write them sequentially. So in fact, your write time will be two times bigger. So as soon as you can write two times more in parallel, everything is fine. So the new solution is just allowing you to write faster in parallel, and because you will have many writes,
23:01
a conquer in France going together, and you will hide this latency time, which is increased by two. It's not bigger, it will be unlocked in fact. So you're mostly blocked only by storage. Currently, you're blocked by design. So this is a problem. But as soon as environment will come,
23:22
you know, so you just, you don't need storage anymore, because the problem will, you write twice, you will kill your flash device, for example, two times faster. So as soon as you will have NVRAM, which you can rewrite a new time as you want, and it's battery protected, so you don't care anymore. It will just go in memory and it's fine.
23:47
Yeah, but this is double right. First write to memory and it's protected, and then you write to disk. So at least one will be safe, so it's fine. Other question, yeah?
24:09
Well, so we are working on full redesign of transaction management. So in fact, all this lock-in which happens today, which may kill you, I have a detailed blog post, for example, explaining the stuff around
24:23
why it couldn't be dangerous. So as soon as this lock-in will go, you will be just happy with committed read. So you know, so just be patient. It's coming. No question? Okay. Thank you.