LSM-Trees and B-Trees: The Best of Both Worlds

Video in TIB AV-Portal: LSM-Trees and B-Trees: The Best of Both Worlds

Formal Metadata

LSM-Trees and B-Trees: The Best of Both Worlds
Title of Series
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
LSM-Trees and B-Trees are the two primary data structures used as storage engines in modern key-value (KV) stores. These two structures are optimal for different workloads; LSM-Trees perform better on update queries, whereas B-Trees are preferable for short range lookups. KV stores today use one or the other. However, for modern applications with increasingly diverse workloads, limiting KV stores to utilize only one of the two designs leads to a significant loss in performance. We propose a novel method of online transitioning a KV store from an LSM-Tree to a B-Tree and vice versa. This allows KV stores to smoothly adapt to changing workloads and use the optimal data structure as the workload changes.
Data management Group action Level set method Query language Magneto-optical drive Student's t-test Quicksort Endliche Modelltheorie Data structure Condition number Physical system
Data storage device Data structure
Web page Group action Greatest element Stapeldatei Key (cryptography) File format Range (statistics) Data storage device Set (mathematics) Data storage device Parameter (computer programming) Heat transfer Mereology Process (computing) Semiconductor memory Computer configuration Network topology Single-precision floating-point format Energy level Configuration space MiniDisc
Web page Group action Overhead (computing) Data storage device Thresholding (image processing) Proper map Number Energy level Endliche Modelltheorie Data structure MiniDisc Condition number Curve Key (cryptography) Web page Expression Mathematical analysis Pivot element Greatest element Query language Network topology Order (biology) MiniDisc Right angle Energy level
Web page Digital filter Group action Implementation Divisor Direction (geometry) Multiplication sign Range (statistics) Insertion loss Workload Mechanism design Virtual memory Network topology Read-only memory Semiconductor memory Query language Divisor Data structure MiniDisc Algorithm File format Web page Mathematical analysis Data storage device Range (statistics) Data storage device Subject indexing Pointer (computer programming) Level set method Query language Network topology Order (biology) Reading (process) Sinc function
Level set method
in our work we examine transitions
for entries and the trees and in doing so we're gonna use cost models to decide which sort of transition however in the use and we also proposed using gradual transition so that we can keep our system online while the condition is happening and to not incur too much latency on queries that happened during the transition so 1st take a look at
transitioning from last entry to a B-tree so how do we actually moved the data from 1 structure to another while we propose 2
methods of doing this the 1st of which we call the top heavy methods and so this method we simply
take all the entries in our century and we merge them together and sort them into a compact array on a persistent storage device and use that to build the B-tree a relatively simple way of getting it all together into formats that's compatible for the tree another option when you can imagine is we take the lowest level of our century and without actually moving that didn't we constructed the tree by in memory from from this lowest level of data and though we can do is we can take all the entries from the upper levels century and insert them in a batch fashion into the into this the tree that we just constructed as we call this the bottom having approach and the reason why we call this is is more efficient when the data in a US entry is structured in a bar anyway because of most about it is the lowest level we got you have to move all the data conductors transition so the go closer look at
our top heavy transition so at each step of a sovereign of what we do is we take the lowest of the lowest keys from oral century and we we take those corresponding entries and lift them into memory and they condense them and sort them and then we take this these are the single page or set of pages that we've loaded and we write them to disk as part of a B tree and so the amount of data that we actually move at each step of this transition is a configural parameter and we'll talk more about that later so we continue this process we take the next set of keys of lexical entries rather that have the lowest keys so the memory and append them to end of a leaf level of arbitrary I continue to do this until all the data in our US entry is consumed and we have a complete trees the approach bottom-heavy approach is where we start with an O Cemetery as mentioned earlier we take the lowest level and we turned that into the tree without actually having to the data on disk though we do is we look at each of the pages in the leaf level of this the tree to lift into memory and all we do is we take the countries that are in the upper levels our century and we see OK do the ranges of these keys interact at all with this data that we just read from the B-tree and what we can do then is we merge them in a very efficient batch manner when this is on the rate and know we do is once we have those full dense nodes constructed we write them back to desk and so you repeat this until we have a transfer all the data from the upper levels or else entry into the lowest level this the tree and this is really nice because we can avoid rewriting a lot of the data in the lowest level yeah as long as we don't have too many he's lover entries so how do
we actually choose which 1 of these transitions to a to execute so we we have we develop the cost small for these transitions by looking at the amount of disk I that is necessary to translate the data from 1 do structure to another and we found this free elegant condition that shows up and if the ratio of data in the upper levels last entry to the model in the lowest level is less than some threshold then the bottom-heavy approach is better and this goes all the intuition that says if if the amount of data in the lowest level is relatively large then we should do the bottom-heavy approach and in this expression we have on the right idea is the entry size of each entry in our key-value store fee is that this great star this right cost in P is the page size of our persistent storage device so in order to the blood this transition gradually we start with analysis entry and while the transition is underway we maintain this intermediate structure we have partial century in a partial tree and maintain a pivot key and as we as we receive queries while the transition is underway we look on either side of the pivot to direct the query to the appropriate intermediate store and then we're done love for the B-tree in our century is consumed so we have this interesting trade-off between the number of steps and the cost of each step and so we your cost model we we saw this curve and a nice thing about this this shows that we can actually amortize cost over an arbitrary number of steps so we can use to say all of the transition to be an hour or a day or perhaps I want the extra latency of my key value store do not exceed 1 millisecond and in doing so we can compute the proper number of steps and make the transition are last in such a way that we don't have a much too much overhead OK now we're gonna look at the transition from a B-tree to analysis entry to hearing of a
B-tree with data stored in a persistent layer in fragmented B-tree tree leaf nodes a costly approach would be to read all this data from the leaf nodes and write it out into a
contiguous region in our persistent storage and then use this contiguous region as the bar on of NLS entry we would construct a bloom filter which is the indexing structure for a B-tree for analysis entry gives me over this merged data and then any subsequent data that comes in these inserts are just stored on top of it this requires that we read and rewrite all dated desks so it's pretty costly on the order of 2 N where N is the size of the data and other approach is we take our arbitrary and instead of using the traditional B-tree storage mechanism we additionally also have an indexing structure that we use for analysis entries we maintain a bloom filter and instead of reading and writing all a studio from the leaf nodes we create an in memory mapping from page ideas to be tree to be physical pages so these are pointers that mapped these fragmented pages and this is an in memory in direction to trick the LSM tree algorithm so we build the Ellison tree on top of this the bloom filter that we mentioned earlier reports over and we have essentially analysis entry with an interoperable data format that we can use until the 1st large this reduces our disguise goes from 2 N. 2 0 all that we have is a minimal downside of a memory footprint slower long-range searches and 1st mergers by just a small factor less than 1 and we think that this is a reasonable trade-off because we anticipate that not many Range Queries should occur in this case since we're transitioning to in our century when we hired update heavy workload and so once we have the 1st merge we switch to a regular a entry and we no longer are occurring this greater cost on long-range searches this is our
implementation of our transitioning data structure here you can see that we move from a right heavy to read heavy workload we transitioned from analysis entry to of the treaty and this transition is done gradually and so as the transition evens out and the cost to slow the transition from the B-tree to The Yellow century has no describe those and so the time is minimal we've achieved our best of both worlds algorithm thank you so much for
listening if


  356 ms - page object


AV-Portal 3.20.2 (36f6df173ce4850b467c9cb7af359cf1cdaed247)