LSMTrees and BTrees: The Best of Both Worlds
Formal Metadata
Title 
LSMTrees and BTrees: The Best of Both Worlds

Title of Series  
Author 

License 
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. 
Identifiers 

Publisher 

Release Date 
2019

Language 
English

Content Metadata
Subject Area  
Abstract 
LSMTrees and BTrees are the two primary data structures used as storage engines in modern keyvalue (KV) stores. These two structures are optimal for different workloads; LSMTrees perform better on update queries, whereas BTrees are preferable for short range lookups. KV stores today use one or the other. However, for modern applications with increasingly diverse workloads, limiting KV stores to utilize only one of the two designs leads to a significant loss in performance. We propose a novel method of online transitioning a KV store from an LSMTree to a BTree and vice versa. This allows KV stores to smoothly adapt to changing workloads and use the optimal data structure as the workload changes.

00:00
Data management
Group action
Level set method
Query language
Magnetooptical drive
Student's ttest
Quicksort
Endliche Modelltheorie
Data structure
Condition number
Physical system
00:25
Data storage device
Data structure
00:42
Web page
Group action
Greatest element
Stapeldatei
Key (cryptography)
File format
Range (statistics)
Data storage device
Set (mathematics)
Data storage device
Parameter (computer programming)
Heat transfer
Mereology
Process (computing)
Semiconductor memory
Computer configuration
Network topology
Singleprecision floatingpoint format
Energy level
Configuration space
MiniDisc
03:21
Web page
Group action
Overhead (computing)
Data storage device
Thresholding (image processing)
Proper map
Number
Energy level
Endliche Modelltheorie
Data structure
MiniDisc
Condition number
Curve
Key (cryptography)
Web page
Expression
Mathematical analysis
Pivot element
Greatest element
Query language
Network topology
Order (biology)
MiniDisc
Right angle
Energy level
05:25
Web page
Digital filter
Group action
Implementation
Divisor
Direction (geometry)
Multiplication sign
Range (statistics)
Insertion loss
Workload
Mechanism design
Virtual memory
Network topology
Readonly memory
Semiconductor memory
Query language
Divisor
Data structure
MiniDisc
Algorithm
File format
Web page
Mathematical analysis
Data storage device
Range (statistics)
Data storage device
Subject indexing
Pointer (computer programming)
Level set method
Query language
Network topology
Order (biology)
Reading (process)
Sinc function
07:36
Level set method
00:04
in our work we examine transitions
00:07
for entries and the trees and in doing so we're gonna use cost models to decide which sort of transition however in the use and we also proposed using gradual transition so that we can keep our system online while the condition is happening and to not incur too much latency on queries that happened during the transition so 1st take a look at
00:29
transitioning from last entry to a Btree so how do we actually moved the data from 1 structure to another while we propose 2
00:37
methods of doing this the 1st of which we call the top heavy methods and so this method we simply
00:42
take all the entries in our century and we merge them together and sort them into a compact array on a persistent storage device and use that to build the Btree a relatively simple way of getting it all together into formats that's compatible for the tree another option when you can imagine is we take the lowest level of our century and without actually moving that didn't we constructed the tree by in memory from from this lowest level of data and though we can do is we can take all the entries from the upper levels century and insert them in a batch fashion into the into this the tree that we just constructed as we call this the bottom having approach and the reason why we call this is is more efficient when the data in a US entry is structured in a bar anyway because of most about it is the lowest level we got you have to move all the data conductors transition so the go closer look at
01:37
our top heavy transition so at each step of a sovereign of what we do is we take the lowest of the lowest keys from oral century and we we take those corresponding entries and lift them into memory and they condense them and sort them and then we take this these are the single page or set of pages that we've loaded and we write them to disk as part of a B tree and so the amount of data that we actually move at each step of this transition is a configural parameter and we'll talk more about that later so we continue this process we take the next set of keys of lexical entries rather that have the lowest keys so the memory and append them to end of a leaf level of arbitrary I continue to do this until all the data in our US entry is consumed and we have a complete trees the approach bottomheavy approach is where we start with an O Cemetery as mentioned earlier we take the lowest level and we turned that into the tree without actually having to the data on disk though we do is we look at each of the pages in the leaf level of this the tree to lift into memory and all we do is we take the countries that are in the upper levels our century and we see OK do the ranges of these keys interact at all with this data that we just read from the Btree and what we can do then is we merge them in a very efficient batch manner when this is on the rate and know we do is once we have those full dense nodes constructed we write them back to desk and so you repeat this until we have a transfer all the data from the upper levels or else entry into the lowest level this the tree and this is really nice because we can avoid rewriting a lot of the data in the lowest level yeah as long as we don't have too many he's lover entries so how do
03:22
we actually choose which 1 of these transitions to a to execute so we we have we develop the cost small for these transitions by looking at the amount of disk I that is necessary to translate the data from 1 do structure to another and we found this free elegant condition that shows up and if the ratio of data in the upper levels last entry to the model in the lowest level is less than some threshold then the bottomheavy approach is better and this goes all the intuition that says if if the amount of data in the lowest level is relatively large then we should do the bottomheavy approach and in this expression we have on the right idea is the entry size of each entry in our keyvalue store fee is that this great star this right cost in P is the page size of our persistent storage device so in order to the blood this transition gradually we start with analysis entry and while the transition is underway we maintain this intermediate structure we have partial century in a partial tree and maintain a pivot key and as we as we receive queries while the transition is underway we look on either side of the pivot to direct the query to the appropriate intermediate store and then we're done love for the Btree in our century is consumed so we have this interesting tradeoff between the number of steps and the cost of each step and so we your cost model we we saw this curve and a nice thing about this this shows that we can actually amortize cost over an arbitrary number of steps so we can use to say all of the transition to be an hour or a day or perhaps I want the extra latency of my key value store do not exceed 1 millisecond and in doing so we can compute the proper number of steps and make the transition are last in such a way that we don't have a much too much overhead OK now we're gonna look at the transition from a Btree to analysis entry to hearing of a
05:17
Btree with data stored in a persistent layer in fragmented Btree tree leaf nodes a costly approach would be to read all this data from the leaf nodes and write it out into a
05:27
contiguous region in our persistent storage and then use this contiguous region as the bar on of NLS entry we would construct a bloom filter which is the indexing structure for a Btree for analysis entry gives me over this merged data and then any subsequent data that comes in these inserts are just stored on top of it this requires that we read and rewrite all dated desks so it's pretty costly on the order of 2 N where N is the size of the data and other approach is we take our arbitrary and instead of using the traditional Btree storage mechanism we additionally also have an indexing structure that we use for analysis entries we maintain a bloom filter and instead of reading and writing all a studio from the leaf nodes we create an in memory mapping from page ideas to be tree to be physical pages so these are pointers that mapped these fragmented pages and this is an in memory in direction to trick the LSM tree algorithm so we build the Ellison tree on top of this the bloom filter that we mentioned earlier reports over and we have essentially analysis entry with an interoperable data format that we can use until the 1st large this reduces our disguise goes from 2 N. 2 0 all that we have is a minimal downside of a memory footprint slower longrange searches and 1st mergers by just a small factor less than 1 and we think that this is a reasonable tradeoff because we anticipate that not many Range Queries should occur in this case since we're transitioning to in our century when we hired update heavy workload and so once we have the 1st merge we switch to a regular a entry and we no longer are occurring this greater cost on longrange searches this is our
07:12
implementation of our transitioning data structure here you can see that we move from a right heavy to read heavy workload we transitioned from analysis entry to of the treaty and this transition is done gradually and so as the transition evens out and the cost to slow the transition from the Btree to The Yellow century has no describe those and so the time is minimal we've achieved our best of both worlds algorithm thank you so much for
07:39
listening if