MyRocks in the Wild Wild West!
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 490 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/46947 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Data storage deviceTwitterComputer animation
00:23
ForceOpen sourceDatabaseBenchmarkBlogInformationSlide ruleLink (knot theory)SequelData storage deviceComputer animation
00:55
Data storage deviceSource codeLevel (video gaming)FacebookDistribution (mathematics)Server (computing)Alpha (investment)Data compressionState diagramDefault (computer science)Point (geometry)Utility softwareRevision controlSource codeSequelLevel (video gaming)AreaFacebookDistribution (mathematics)Link (knot theory)Server (computing)Data storage deviceKey (cryptography)Internet service providerEscape characterBlogProduct (business)Different (Kate Ryan album)Data compressionCASE <Informatik>ImplementationOperator (mathematics)OracleComputer filePurchasingLengthOrder (biology)Uniform resource locatorComputer animation
03:30
Computer-generated imageryMarkov chain Monte CarloLevel set methodNetwork topologyFile formatLevel (video gaming)MiniDiscData streamCompact spaceWeb pageDifferent (Kate Ryan album)Computer fileKey (cryptography)NumberRight angleSemiconductor memoryTable (information)Link (knot theory)Data storage deviceComputer animation
04:20
Level set methodNetwork topologySequenceRead-only memoryLevel (video gaming)Replication (computing)File formatMixed realityStatement (computer science)Database transactionLocal GroupAlgorithmConnectivity (graph theory)AreaRight angleString (computer science)Analytic continuationCompact spaceCAN busBuffer solutionSequenceTable (information)Data structureLibrary (computing)Group actionPower (physics)File formatRow (database)Search engine (computing)Replication (computing)Database transactionTrailCompass (drafting)2 (number)Metropolitan area networkComputer animation
05:28
Read-only memoryArchitectureComponent-based software engineeringTable (information)Level set methodData structureFamilyRight angleNetwork topologyData structureWritingCompact spaceLogicCompass (drafting)FamilyPartition (number theory)Computer animation
05:51
ArchitectureRead-only memoryComputer fileLevel (video gaming)Compact spaceRight angleMiniDiscSynchronizationTable (information)Compass (drafting)FamilyPartition (number theory)LogicQuicksortComputer animation
06:18
Reading (process)WritingTable (information)Right angleData storage deviceMathematical optimizationFamilyPhysical lawFitness functionComputer animation
06:43
ArchitecturePrice indexMaxima and minimaDigital filterCache (computing)Right angleCrash (computing)Data recoveryReading (process)Physical lawInformationGroup actionSemiconductor memoryBuffer solutionSubject indexingProcess (computing)Level (video gaming)Compact spaceTable (information)Filter <Stochastik>Computer fileComputer animation
07:28
Level (video gaming)Level set methodLevel (video gaming)Computer fileCompass (drafting)Link (knot theory)Different (Kate Ryan album)Compact spaceComputer animation
07:57
Level set methodAlgorithmDefault (computer science)Server (computing)Execution unitFamilyWeb pageOverhead (computing)Level (video gaming)Price indexTexture mappingConfiguration spaceParameter (computer programming)Computer fileBlogDatabase transactionWritingSpacetimeData compressionError messageStatement (computer science)Replication (computing)BackupPartial derivativeData recoveryMathematical optimizationRow (database)SynchronizationPerpetual motionPhysical systemServer (computing)Utility softwareData managementBenchmarkLibrary catalogData recoveryPartition (number theory)Order (biology)Table (information)Game controllerPhysical lawSet (mathematics)Computer fileConfiguration spaceMereologyLevel (video gaming)MiniDiscPoint (geometry)Revision controlEndliche ModelltheorieCompact spaceCore dumpView (database)BackupSoftware testingPartial derivativeReplication (computing)Data compressionCartesian coordinate systemRow (database)Event horizonFitness function1 (number)Statement (computer science)Parameter (computer programming)Multiplication signMathematicsConnectivity (graph theory)FamilyRight angleTerm (mathematics)Reading (process)WorkloadMixed realityInformationBlock (periodic table)CausalityFunction (mathematics)Matching (graph theory)Binary filePerturbation theoryLoginExecution unitMultilaterationFlash memorySpacetimeSynchronizationOperating systemAlgorithmMoving averagePerpetual motionComputer animation
13:19
BenchmarkNumberMathematicsSoftware testingWeightBenchmarkRight angleSoftwareThread (computing)Mathematical optimization
14:26
BenchmarkSoftware developerSoftware testingSet (mathematics)BenchmarkRight angleRevision controlMathematical optimizationThread (computing)
15:06
Point cloudFlash memoryData storage deviceArithmetic meanData compressionExecution unitLink (knot theory)Point cloudNumberMiniDiscClient (computing)Different (Kate Ryan album)BlogPairwise comparisonCompass (drafting)Compact spaceComputer animation
16:13
Subject indexingRange (statistics)Set (mathematics)Point cloudRead-only memoryData storage deviceCartesian coordinate systemType theoryLink (knot theory)VotingLie groupAreaImplementationSet (mathematics)Right angleSubject indexingWritingWorkloadBlogReading (process)Computer animation
17:16
Service (economics)Human migrationExtension (kinesiology)Level (video gaming)Software developerFacebookComputer animation
17:45
View (database)Wechselseitige InformationOnline helpFacebookCASE <Informatik>Web 2.0Software testingApproximationImplementationSlide rulePairwise comparisonSimilarity (geometry)Product (business)Network topologyLibrary (computing)Filter <Stochastik>Server (computing)Scaling (geometry)Data dictionaryRight angleSequelProxy serverAmenable groupComputer configuration1 (number)Computer animation
20:21
Data storage deviceRead-only memoryArchitectureReading (process)QuicksortComputer fileRight angleTable (information)Semiconductor memoryMetropolitan area networkPrinciple of maximum entropyOperator (mathematics)String (computer science)WritingComputer animation
21:25
Replication (computing)Statement (computer science)Mixed realityFile formatDatabase transactionLocal GroupRead-only memoryLevel set methodNetwork topologyChemical equationLevel (video gaming)Distribution (mathematics)Server (computing)Alpha (investment)Group actionBlock (periodic table)Computer programmingSlide ruleMatching (graph theory)Mixed realityRange (statistics)Right angleGoodness of fitComputer animation
23:19
Point cloudFacebookOpen source
Transcript: English(auto-generated)
00:08
So wait the end of the questions before leaving, please. Thank you for everyone for coming today. We're going to talk about my rocks in the wild, wild west. And to start the talk, if you haven't met, please add me on Twitter or follow.
00:27
My name is Alcan and I work at Parakona in managed services department as Parakona and a few other people over here. Today's agenda will be going to introduce my rocks and talk about some of the background and some notes.
00:42
We won't get into two details. There are the slides will be available and I will provide a lot of links, blog posts and information. And if you want to have asked questions, you can reach out to me or some others in the in the references. So what is my rocks? My rocks, the storage engine, it's a hidden gem in my sequel, as I call it, because it exists there for a while.
01:02
And not many people know or use it. And it's basically a key value store that is added to my sequel. It's based off of rocks TV and it was also a fork of a level TV. So these technologies existed for a while. It's forked and it had its own use cases and started using.
01:24
Of course, this was needed for some large operations. And we know it's been implemented in Facebook. And others also still use it as a rocks TV or my rocks distribution. There's a link provided. There's some new new blog post that actually identifies a few areas of interest that actually utilize rocks TV.
01:44
And initially it was a source code. Then it was kind of picked up some attention and brought into attention of the my sequel product providers other than Oracle like my park on a and my escape. Maria DB and released.
02:01
So what is how do we get my rocks? We get it. We are a park on a server or Maria DB park on a server institutes this in in two thousand seventeen and on fully supported in five point seven and eight or releases. And it's been also available for a while in Maria DB from ten that three and ten that two later distributions.
02:24
Basically, they come built in. You don't need to do an extra download or anything like that. It comes with if you download park on a server eight that zero that I think eighteen is the latest version. It will come along with that. There are some differences between the distribution.
02:42
So so one main difference is the Facebook distribution. Facebook doesn't use the mainstream version of of rocks rocks TV my rocks implementation. And and the park on a server has some different implementations for the compression algorithms, and there are some minor difference in the data pile locations.
03:04
The other thing that I wanted to highlight over here is the gap lock detection. This is like something major and the way that the engine works. So the park on a server and the Facebook, it supports Facebook doesn't actually use it. Park on a server errors out and the Maria DB. There is no gap lock detection in in the my rocks engine.
03:23
If you want to read about some details, there will be some links in which we won't get in now. Okay, going back to the key value store. So we have the engine versus LSM three style, which my rocks utilizes. I actually had to put up this infamous page and the link over here.
03:44
Please have a read. So the way that it works is data streams of key value pairs actually written in the in the ma'am tables, which is the memory, actually. And then they're flushed to the disk in a level format. And the compaction happens in each level to create a small, larger, small number of files.
04:06
That's the main difference between the B three and where B three actually handles the data stream is slightly different than the LSM. So this is the right pictures is is what the my rocks actually utilizes.
04:20
So in short, over here, the B three actually mostly read optimizes. We know uses buffer pool utilizes and actually rights are in place. And and the LSM tree, the rights are sorted string tables are written in sequentially. And then the compaction happens in the background from the algorithms that we mentioned above.
04:44
You could choose your algorithm and and then has a fast access to data. Continue on the InnoDB. So not undermining what InnoDB can. InnoDB is a very powerful storage engine and can be used in other areas of Galera library, right?
05:14
So if you go back to my rocks and the components of the my rocks, I'm going to skip this and show you the how the right request is handled.
05:33
And basically we have the ma'am tables, the right ahead log, which actually they go along together. And then the leveled LSM tree structure that actually does the data writing and
05:46
the compaction happens in the in the in the background for the logically partitioning. We have also call and families. And when a right request comes in, we have the active ma'am tables are actually written immediately with the right ahead log.
06:03
And then they're flushed to the to the sorted sync files and the compaction happens on that level. And this allows the fast right and sorted files to be available in the disk. So we have we have a concept of logically partitioning the data with the column families.
06:26
And the ma'am tables actually allow the rights to happen fast while we also have a log of it written in the in the storage and then gets flushed to the table. So they're limited to 64 megabytes and and a fast read and write optimized.
06:44
The wall is the right ahead log, which we actually compared to our InnoDB's reader logs. And when an immediate right happens, it's also written into the log. This also helps the crash recovery of the engine and and and the my rocks actually the similar way on the read request.
07:06
Actually, the opposite way of trying to achieve that information process via memory memory tables and then which is which is similar to buffer pool in InnoDB. And then with the index and the balloon filters, you can actually access to those leveled and compacted data files.
07:24
And and then retrieve them fast. So the level compaction is also targeted for different data sizes. And as the compaction happens, it gets in a larger and larger. The fewer level is better. But if the data size is is big, then you have actually fewer files that are already sorted and merged into different files.
07:49
And there's more details about Yoshinori's tutorial over here. There's a link. It's a very detailed explanation that he made. So there are some improvements over the compaction.
08:03
So they are aligned with the operating system units and and then also the Percona server. There has been some tests and benchmarks made in different algorithms. LZ4 was selected as as the medium level algorithm.
08:21
And and then there are other compression methods are also available. So the column in the in the other components over here that wanted to highlight is the column families. So it actually allows the partitioning of the and the column values also allows us to configure my rocks properly, depending on the data set.
08:44
And that's all the configuration parameters, which will not get into it today. But you can actually have a more control over the data, how it's been accessed. So there are some advantages on on the on the disk versus in a DB on the LSM side.
09:06
And there's a there's a you know, the lower right penalty and reduced fragmentation over it. And so this is being advised as a as a as a good fit for the right heavy workloads.
09:23
Also, there are some advantages on the flash with the compaction and the compression you get actually save space. And and there is a lower right amplification for writing on the on the flash. We mentioned about the gap lock and there's a there's a roll locking available.
09:43
It's only read committed and repeatable read. And and then we don't I mean my rocks doesn't support gap lock as much as the DB for the replication side. It has to be role level and then we'll have a large bin logs.
10:01
And again, once again, the statement based replication will cause gap locks. But you can still use it on the on the slaves. So this is the high level of Mike my rocks engine. And then how do we actually mix and match? It is not very much advice to mix and match in a DB engine and the my rocks engine at the same server.
10:24
But if you do, the extra backup still works. It's on only on eight that all with the later versions of extra backup. And it's both optimized to take a backup of in a DB and my rocks at the same time. And there are no partial backups.
10:42
So you can't actually write on an in the DB and right on on my rocks. Maybe like a different application, different the model. You can't say, OK, I just need to back up for my rocks. If you have the backup, you have to back up the whole thing. It's so you have to design it that way.
11:00
Maria DB also Maria Maria backup also supports after 10 to later versions. They don't also support my rocks partial backups. There's also my rocks hot backup tool that was originally designed to back up the rocks DB and the checkpoint and the log file.
11:23
But that's only for my rocks. So there is a trade off between using the extra backup Maria backup or the my rocks hot backup. And that way you have to be cautious about what's your recovery method is.
11:45
There is also MySQL dump and you can basically dump the data from my rocks. And right now, snapshots are kind of difficult when you're mixing the engine.
12:02
And then on the my rock side, you have to have the checkpoint and the log to be able to go into that point of recovery. So if you go back to the to the other tool compatibility, let's say we have a my rocks engine in our ecosystem. So PMM is per corner monitoring and management utility.
12:21
It has built in dashboards. So what that means is you have a system and catalog files, information schema, information that you can actually gather, show engine rocks DB status output. That information is there. So that that tool is we consider it's a compatible. You can actually monitor, analyze and record the events.
12:45
The extra backup is supported for the online schema change. It's partially supported because it's read committed and PT table checksum and sync. You can't do it with because these are required to have a role level bin log.
13:03
So that's the tooling compatibility. You need to plan ahead of time if you're actually going to test my rocks and how my tooling and automation and other stuff works. Okay, this is the benchmarks aren't the subject of this talk,
13:25
but as a reference, we wanted to have a example over here. I will not go into too much details, which when these tests were were ran, we were on 8.0.16. Now we're on 8.0.18 on engine.
13:41
There has been some optimizer changes lately, and then we believe there's some numbers might be skewed because of that. I am from Percona. Percona's engineering department is working on new benchmarks, which I will also work with some other friends over here to analyze.
14:01
But we can see as the threads are higher, the InnoDB gets kind of saturated, but the IO-bound network RocksDB excels in the CSPANCH TPCC test.
14:21
I'm sorry. These are writes. And then we have also read and write tests over here. Same thing over here. You can see in the RocksDB as the 64 threads and more, it excels over InnoDB engine.
14:43
Like I said, we will share more benchmarks and details about the latest developments with the latest version we have, and then compare some of the data versus. But the highlight of the RocksDB and MyRocks engine
15:01
is to have a larger data set writes optimization. So why would you use MyRocks? Because now most of our clients are in cloud. They pay per disk and disk space. So your costs are not going to be just,
15:23
OK, I have these storage units, and it's going to run in the background, and I can use it. This is like as you provision more. There's a link over here for another blog post that there is a comparison. The chart is very small, but it's cost efficient for cloud
15:40
as you can actually utilize and use different compression algorithms to get more out of that than InnoDB. I mean, there are some numbers that Yoshinor actually shared earlier about comparison between InnoDB compression versus the compaction
16:04
that RocksDB provides. It's much higher. So that way, simply, you can save costs in the cloud or your storage. In conclusion, we think it's big for big data sets are actually more suitable.
16:22
If you have a lot of indexing, and again, repeating over here for the right intensive workloads are actually much more suitable that we think. So basically, the data that has a read write type or a style of applications that actually you're
16:41
writing and reading quickly, and that might be suitable for that. The other thing I wanted to mention is earlier, the RocksDB as an engine, as a technology, is used in some other areas. They are not like, as far as we know, they are not like my Rocks implementation of the engine.
17:02
But the technology that relies, underlying technology that's in the RocksDB is being used in several other implementations. There's a blog post and a link at the end of the slides. And I wanted to also thank Yoshinori and Vadim Sveta
17:21
and Mark, especially. They've put in a lot of work in RocksDB. And Yoshinori is over here, if you have any questions. And there has been extensive research and development to come to this level, where Facebook is running with the migration of RocksDB. And thank you.
17:47
OK. Are there any questions? Can you repeat the question? You have to repeat. Oh, I have to repeat. OK.
18:00
The mic. Oh, the mic, because of the mic. Sorry. OK. OK. The comparison between the My Rocks and the TokuDB. We have initially made some comparisons. And I will have to look up those. The TokuDB is also similar technology.
18:22
There are similar use cases. But all the data that we had, or I found, they were all outdated. So there are really no good comparison. And as My Rocks is being kind of accepted,
18:40
both Percona server and the MariaDB server adopted into it, it's going into production as one of the largest web scale MySQL topologies in Facebook. We don't find too much value in going back and doing testing in TokuDB right now. So more like more InnoDB, MySQL 8 comparisons,
19:02
where you actually have an option, do I actually have a benefit from RocksDB, My Rocks implementation. Any other questions? You said you're using Bloom filters? Yes. I mean, most people today is moving to portion filters, which are much more efficient and faster, like Facebook just published the implementation
19:22
in the Foley library. Maybe you will try that. I mean, it's an approximate dictionary, but it's much more efficient than a bulk filter. OK, so it's not really a question, but it's more like a suggestion and commentary. OK, so yeah, I don't know much of the details on that,
19:40
but we'll have to look at it. What was the question? We're suggesting to use portion filters, which are like better approximate dictionary than filters like the one that is Facebook now. OK, so it's more like a suggestion to use different filters than a Bloom filter is the question.
20:01
Yes. How updates and deletes are handled. OK, so how updates and deletes are handled? Basically, we had a slide over, I think, over here. Let me see.
20:24
Let's look. We still have time? Yeah. OK. OK. So basically, the way that it works
20:41
is the mem tables are handling the writes. And actually, the sorted strings come in when there's an update. You have the merging. So the writes are coming in, and the updates are coming in for the data. And as you already compacted and leveled
21:03
the data into the files, those sorted strings, and it gets actually, basically, it's flushed and redone in the memory. So basically, it's an ongoing operation as far as I know how it's handled. So it needs to be remerged later?
21:21
Yes. Any other questions? Aunty. Right. Good question. Does Percona have gap locking?
21:40
It doesn't have a gap locking. It has a gap locking detection. So it actually detects there's a gap lock. There could be a gap lock. Yes, yes.
22:02
At or out, yes. You can have, well, you can't have mix and match, but you can't support it. But it detects that there's a gap locking.
22:56
It's the detection range update.
23:03
OK, thank you. Thank you. Thanks for the question.