We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

An R-tree index for RocksDB

00:00

Formale Metadaten

Titel
An R-tree index for RocksDB
Serientitel
Teil
8
Anzahl der Teile
193
Autor
Lizenz
CC-Namensnennung 3.0 Deutschland:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
This talk is about implementing a R-tree on top of RocksDB, an LSM-tree (log-structured merge-tree) based key-value store. It will give an introduction about how RocksDB works and why LSM-trees are such a perfect fit to build an R-tree index on top of them. Finally there will be a deep dive into the actual R-tree implementation. RocksDB is a popular key-value store by Facebook (based on Google's LevelDB). It supports key and range lookups and basic spatial indexing based on GeoHash, but not an R-tree implementation which makes multi-dimensional queries possible. Those queries can combine things like location and time, but also any other property that can be represented as a numeric value, such as categories. This makes it possible to query e.g. for all flats with a certain size in a specific area that are not older than a few years and have a balcony.
78
Vorschaubild
51:51
154
Vorschaubild
35:04
DreiMehrwertnetzDatenbankSoftwareentwicklerProgrammierparadigmaDifferente
DreiWeb SiteIndexberechnungTopologieSchlüsselverwaltungDatenbankp-BlockSpannweite <Stochastik>Open SourceOffene MengeReelle ZahlDivisionGoogolApache CassandraDatenstrukturEinfacher RingARM <Computerarchitektur>VektorpotenzialQuick-SortAusgleichsrechnungEin-AusgabePaarvergleichHydrostatikResultanteZahlenbereichInformationsspeicherungFontBildschirmfensterMini-DiscGebäude <Mathematik>Physikalisches SystemGraphische BenutzeroberflächeNotepad-ComputerElektronische PublikationSchlüsselverwaltungMAPBildschirmmaskeFunktion <Mathematik>TopologiePhasenumwandlungQuick-SortQuellcodeSoftwareentwicklerMereologieFortsetzung <Mathematik>BitMathematikPunktwolkeRechter WinkelForcingRechenwerkQuaderZeichenketteFitnessfunktionDatenbankImplementierungInteraktives FernsehenArithmetisches MittelDatenstrukturLineare RegressionFacebookMultiplikationsoperatorOpen SourceCASE <Informatik>Ein-AusgabeAbfrageMaßerweiterungTabelleProjektive EbeneSpannweite <Stochastik>Pay-TVKontrollstrukturTypentheoriep-BlockMinimumBildgebendes VerfahrenVersionsverwaltungEinsSoftwareProzess <Informatik>HalbleiterspeicherApache CassandraComputerarchitekturCodeDatenverwaltungFront-End <Software>Reelle ZahlWort <Informatik>Automatische IndexierungVererbungshierarchieOffene MengePhysikalischer EffektProgrammfehlerDynamisches SystemComputeranimation
KurvenanpassungMinkowski-MetrikOrdnung <Mathematik>ZahlenbereichChi-Quadrat-VerteilungAdditionSchnittmengeStellenringQuick-SortRelation <Informatik>BitGleitkommarechnungSpannweite <Stochastik>PunktQuaderDivisionKartesische KoordinatenDreiStrom <Mathematik>AggregatzustandFreewareCASE <Informatik>Patch <Software>Ultraviolett-PhotoelektronenspektroskopieHeegaard-ZerlegungZahlenbereichMinkowski-MetrikKurvenanpassungProzess <Informatik>Ganze ZahlStabNeuroinformatikMultiplikationsoperatorUmwandlungsenthalpieProfil <Aerodynamik>PunktKartesische KoordinatenWasserdampftafelDatenstrukturCASE <Informatik>CodeURLInformatikSinusfunktionEndliche ModelltheorieWort <Informatik>Web SiteDimension 2SchnittmengeGüte der AnpassungSpeicherabzugBitSoftwareentwicklerStellenringAggregatzustandMathematikQuick-SortFamilie <Mathematik>ResultanteCoxeter-GruppeInterface <Schaltung>GrundraumOrdnung <Mathematik>FunktionalSpannweite <Stochastik>RelativitätstheorieGleitkommarechnungQuelle <Physik>ZoomMinimumSymmetrieElektronische PublikationPermutationFreewareRechter WinkelDivisionTeilbarkeitTopologieMereologieFacebookGeradeComputeranimation
Elektronischer DatenaustauschStatechartDreiStreaming <Kommunikationstechnik>KurvenanpassungVorlesung/Konferenz
Elektronischer DatenaustauschDimensionsanalyseKurvenanpassungSchnittmengeKomplex <Algebra>Gebäude <Mathematik>NeuroinformatikAbfrageBenchmarkSichtenkonzeptGewicht <Ausgleichsrechnung>Datenstruktur
SCI <Informatik>RechenwerkWurm <Informatik>FontBenchmarkProjektive EbeneFacebookProzess <Informatik>DatenbankMereologieErwartungswertProxy ServerBildschirmmaskeSchnittmengeSchlussregelDifferenzkernEinsProdukt <Mathematik>Vorlesung/KonferenzBesprechung/Interview
DreiVorlesung/KonferenzComputeranimation
Transkript: Englisch(automatisch erzeugt)
Good afternoon. Welcome to this session. I think it's going to be very interesting. We're going to hear some developments about spatial databases.
We're going to hear about different NoSQL paradigms and also a little bit of cloud architecture. I start with the first speaker, Volke Mishe, who doesn't need a lot of introduction, really. So, he's been collaborating in a lot of open source projects, but he's probably best known by his spatial extension for CouchDB.
So, I'm going to give the word to him and I'll be here sitting, controlling the time. Thank you. Hi, everyone. I'm happy to be here. It's kind of a premiere because I've spoken at the last five FOSFOGs about geocouch.
This time I won't speak about geocouch, but pretty similar. The talk will be a bit technical, but still a lot of fun, I think. So, I got already an introduction, so I skip this and go straight to the topic. So, it's about an archery implementation for RocksDB.
And I start with RocksDB, because I guess not everyone is aware of what it is. So, RocksDB is a key value store. And a key value store is a system where you can store data. And what they always support is something called key lookups, which means you store, for example, you have a billing system
and then you store your invoices and then they have a reference ID. And whenever you want to get the invoice out of the system, you just use the reference ID and get it back. Many key value stores, not all of them, but RocksDB does, support range queries. So, let's say you store weather data and you store every day the temperature.
You can say, give me all the temperatures from the past week. And those key value stores are often a building block for bigger database systems. So, for example, if you have a huge distributed system and the backend is normally a key value store that stores the actual data.
And of course, the question is, well, there are plenty of key value stores out there, why is RocksDB so great? It's real open source. With real open source, I mean, they don't just publish the source code and don't care about the community, but what they do is, they have real open development. So, you can even see their own internal code reviews when they do changes.
You see all the changes they are doing. And you can also contribute yourself. So, you can make a change on GitHub, they will merge it. It can even happen that it breaks the system, which I did in the past week.
But they since then just revert the change and everything is fine again. And they have contributors, they have individuals, but also companies. And what was the most surprising to me was last year, it started that they had contributions from Microsoft.
And what they did, they make sure that RocksDB works on Windows well. And with working well, it's not only about compiling, which basically most projects care about, but it's also about the performance. So, for example, even if there are changes that cause performance regressions on Windows, Microsoft makes sure to fix those issues so it performs well again on Windows.
And as I said, it's a building block for future databases. So, it can already be used as a backend for MySQL, for MongoDB, and Couchbase now has a full text search system, and there it's also pluggable, so there you can also use RocksDB as a backend.
And it's fast. And why is it so fast? So, the main thing about this, why it gets such a performance, is the LSM tree. LSM tree stands for Lock Structured Merge Tree. And it was originally described by a paper in 1996,
and at least as far as I know, it kind of got a bit forgotten for 10 years. One occurrence is then in the Google Bigtable paper, where they describe that they use this data structure for the database, and then also Apache Cassandra took up the system, and this was originally created by people from Facebook.
Then in 2011, Google published LevelDB, which is again a key value store, and they use it within Chrome for their offline storage stuff. And then Facebook came and forked LevelDB and created RocksDB out of it.
So, you can see RocksDB as an improved version of LevelDB. And as I said previously, the improvement is not only on the code level, but also on the community level. So, previously, the LevelDB development wasn't that open. It's getting better, but in the beginnings, it wasn't that open.
All right. So, what is a Lock Structured Merge Tree? It's a data structure, as I mentioned, and the original paper talks about its managing tree-like structures. Google did something else. So, therefore, Bigtable, Cassandra, and also RocksDB use a thing called SSTable. It stands for Sorted Strings,
which means you have flat files, and within the file, you always only store the key and the value, and then the next key and the value, prefixed by the sizes so you can easily read it in, and you can also write it very fast. This is how you really store the data on disk, but there's more.
And it's easy to explain with an example. So, in this example, you see potential SSTables, so they are still empty, just small boxes, and we want to add data in there. In this case, to keep it simple, the keys I insert are just numbers, and we don't care about the value.
So, I add the number 8, and let's see if it fits in. Yes, it fits in, so we are done. The 8 is inserted, everything is fine. Now, we add another number, and now we find out, oh, well, the first is Stable is full. So, we just take it out and sort the results,
because as you've heard, SStable is for sorted strings, so we sort the result and try to merge it with the next level. The next level is empty, it fits in, we are done, everything is fine. Next, we insert another number, the 5, it fits in. I think we slowly get the idea, we insert another thing,
it doesn't fit in, we sort it. Now, it gets interesting, because now we try to merge, and again, it doesn't fit in, and we sort the results again, try to merge it with the next level, it fits in, and we are done. So, we are done too, so I think you get the idea, this is how it goes on, and now we probably want to like,
that's super complicated to just get some sort of thing, you can just probably do it simpler. So, what are the advantages? The advantages, before coming to the details about it, is that the key part is that the data in the steps in between
is always sorted, and merging such sort of data is very fast and efficient. I'm going to take an example, and so, after all those examples, you hopefully get the idea like why this might be a faster system than other systems. So, in this case, we have again those numbers, and now we merge a level from the input, so the top part is the one level,
the lower part is the other level, and we want to merge them. As you can see, they are already sorted, because, well, they are always sorted. Now, we merge them. So, we compare those numbers, it's one and a four, and we just take the smaller one of those, and this is our result, the smaller value.
Now, we can move on on the input and compare the other two numbers. So, we now compare the three and the four, we use the smaller value as the output, and we move on. So, now we have the four and the five, this time from the other input the value is smaller, we take into the result,
move on on the bottom, and this is basically how you move on. You always compare the values, use the smaller one, and get a bit further. As you can see, this is kind of a streaming process. So, you don't really need, if you implement it, you don't need much memory, you can really just move on in a streaming fashion and efficiently read all the data and also store it on disk.
All right. Now, this merging is super efficient. What we also get is that those files never change. So, once you've merged the data, we have a static dataset. And static dataset is always great for building up index structures.
And in our case, we build up an R-tree, and bulk loading such an R-tree from bottom up is way faster than inserting into a data structure dynamically. We have no in-place updates, which is good, because if there might be a bug in the software, you don't overwrite the data,
but it's just some data might be wrong at the end of the file, but not in the middle, because you don't overwrite the file. As I said, the sorting is very fast. And of course, which I haven't shown, is if you delete data, you need some cleanup process. But during this continuous merging all the time, you can just clean up the stuff during this merging step.
So, now we finally come to the GU part of it, because this is easy to show with numbers. And you can easily imagine how to sort numbers, because you've learned it for ages. But the difficult thing is, how do you sort two-dimensional data?
There's a thing called space filling curves. There are many different ones. And one of those is the set order filling curve. And I use it in this case, because it's a very efficient one, it's easy to compute, it's easier than others, but it still gives you quite good results.
So, in this example, we see we have some data scattered around some space. But for now, we don't really look at the data, we just look at the space. So, the question we now ask is, what's the sort order of this? Is it kind of left to right? Is it top to bottom? Is it somehow different? And to answer this question, we look at first only at the space.
So, we have just some empty space, and there are some data in there somehow. Now, we divide the space, and you might now already get the idea. So, now we have four quadrants, and we draw a line in there. And now it also comes clear where the name comes from, the set order curve.
Because as we now draw this line, we can easily number those quadrants and have our order. So, if we now look at the data in our space, we can see that we can derive an ordering. So, in our case, the A is from the location.
The A is smaller than the C, is smaller than the B. And this way, you can easily sort your spatial dataset. But now what happens if you add another data in the same quadrant? You could of course say, well, it's the same quadrant, so D equals B. But this is not very useful.
But what you can do is, and it's another nice thing about the set order curve, you can just split this quadrant again, do the same thing, and now you can see that D is smaller than B with the set fitting curve. Okay. Now we come to my final example.
So, this is already really like some computer science stuff. So, what I've heard so far was more or less an introduction into computer science data structures. And for those who are into the two spaces stuff, it probably wasn't that exciting. But now I come to a thing which is interesting because
it is a problem that I haven't thought of when I wanted to create the LSMR tree. I was in luck because recently, I think three years ago, there was a paper from the university in California, and they did a paper on LSMR trees. And there, they solved the problem, and it's just mentioned in a small sentence
and in the implementation, and it isn't even really mentioned in the paper. But I think it's a great idea, and I want to present what they're doing. Because the problem is, how do you merge then? So, if you've sorted the thing, how do you merge two files? I first again describe the problem. So, you have two data sets, and you want to merge them.
You can now, of course, sort them as we did it previously. But the problem is now is they have only a local sort order. So, you can say within those two data sets, you know how the data is sorted. But how is the relation between those data sets?
Is A smaller than B, or is B bigger than A? It's hard to tell. One naive solution would be you just create another space around all your data and sort it again, but that's obviously super inefficient because you would always need to re-sort the data. This is not what you want to do.
So, yeah, this is basically what I already said. So, it's only sorted locally, and you would need a lot of recomputation. The solution is to just divide all the space you have. Now, you may wonder, well, the space is infinite.
How do we divide an infinite space? The good thing about computers is almost nothing is infinite. And so, there's a solution to this. And I'm pretty sad that I did the presentation already a week ago
because at the code spring I was talking to someone about floating point numbers and so on. And so, I will probably change it and won't use 64-bit floating point numbers for my geospatial data structures. But it would work the same also if you would use integers or fixed point numbers.
It would work the same because numbers on computers are limited by a certain size. So, in this case, the space is pretty huge, but you just take the whole 64-bit range for floating point numbers and you keep dividing the space.
And so, now we have in this whole space, like the full numbers the computer can handle with 64-bits, we have the data somewhere. It's so small, so I displayed it as a point, but it's like there's a lot of data in there, but it's so small you can't really see it, it's just somewhere there. So, you just divide the space that you currently have and look at it.
So, now you say, okay, it's in the first quadrant, but you can't tell within the data how it is sorted. But what you do is, you just look at this quadrant, divide it. Now it's in the third quadrant, so we keep dividing and keep zooming in. And so, now we zoomed in the third quadrant, we still can't really see the data
because it's so small and the space is still so big. So, we keep doing this and in the end we zoomed so closely in that we now, on the final division, we have a sort order. So, the idea behind it is you look at the full space and kind of zoom into your data,
but as all your data has the same set curve, you can now easily merge those two datasets together. So, this was a lot about data structures and all the stuff. One reason is that I thought I would code this within a week. And if a developer says it takes a week, it will probably take a month.
I've spent two months. It kind of works, but not really good. So, the current state is I think I now have a good understanding of RocksDB, which turns out the internals are way more complex than I thought.
And it's a free time project, so I can't spend that much time on it. There's some code already uploaded, so it should compile, but you can't really query it, I think, really. But I really want to keep it going. The good news is that obviously a question you could ask is this whole thing a good idea?
Why doesn't Facebook already do it? The good thing is I was at the RocksDB meetup last year in the US, and they asked a few core developers and one of those said that he thinks it is a good idea, I should try it, and yeah, we'll see how it turns out.
So, for the future, best obviously would be if my changes would be merged upstream with RocksDB, and it would be then supported by them, and you could store space later in there. Still good would be if I can convince the RocksDB core team that they at least change some APIs,
so I can just make an add-on to it and don't need to change the customers of RocksDB. Still OK would be if it's an easy-to-maintain fork. Let's say I need some internal functions that they don't want to expose publicly, I can just call in and have a simple fork, and of course the worst case is
they don't care about it at all, I need to stay on a full fork, but well, that's OK as well, then it is like this. All right, so that's all I have for now, you can check out the code, or if you have any ideas or better ideas how I should solve the problem or want to help, let me know.
Thanks for your attention. So, thank you very much for this interesting talk. I know we have time for a few questions, actually we've got some minutes. OK, first question.
Yeah, OK, so the question was, I repeated so everyone also in the stream can hear it, and the question was if I also load into the Hilbert curve, and if I did, if there were performance issues, I did, the reason why you looked into the Z-curve only is that there's a paper about
query-adaptive, sorry it's called query-adaptive sorted bulk loading or something, and they basically did the work and they compared a Z-curve with a Hilbert curve, and they said that the Hilbert curve creates a better data structure,
kind of, but it's not worth the performance overhead, because the Hilbert curve is way more complex to compute, and the other reason is that if you go into higher dimensions, the Z-order curve is still very simple to compute, and the Hilbert curve gets really complex once you get into more than four dimensions.
That's the reason for the Z-order curve. OK, and any more questions? I just like to ask, do you have some benchmarks of this algorithm? Yeah, so I haven't done any benchmarking, because it's hardly working yet.
Of course it will come, but I would, so from my, what I think is like the SDSM stuff with the static part is so we're suited for arteries and multimensional data, I would expect it is probably the fastest thing you can get.
It might not be super fast, but I don't think you can get much faster, if it's probably implemented. We'll see how it works out, because ProxDB, I think, does a really good job on the performance side, and the benefit is also why I choose ProxDB, as I said, it has such a big community, so even there you can expect future performance improvements,
and basically get them for free, because if they improve something, yeah. And just so, ProxDB is also really run at Facebook in production, so it's not like some toy project from someone, but really, they run parts of their infrastructure really on ProxDB, so it's really a well-working database.
Okay, so if there are no more questions, let's thank the speaker once more.