FITing-Tree: A Data-aware Index Structure
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 155 | |
Autor | ||
Lizenz | CC-Namensnennung 3.0 Deutschland: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/42914 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
SIGMOD 201941 / 155
34
37
44
116
120
122
144
148
155
00:00
DatenverwaltungTopologieMereologieSpeicherabzugDatenbanksinc-FunktionMultiplikationsoperatorVorlesung/Konferenz
00:15
GarbentheorieGeradeUmwandlungsenthalpieDatenbankZeitreihenanalyseMAPTypentheorieSchlüsselverwaltungMultiplikationsoperatorComputeranimation
00:59
MusterspracheTermDistributionenraumOrdnung <Mathematik>Besprechung/Interview
01:22
InternetworkingZeitstempelSommerzeitKlasse <Mathematik>Internet der DingeBitMultiplikationsoperatorMusterspracheZeitstempelKartesische KoordinatenElement <Gruppentheorie>MAPBeamerEreignishorizontOrtsoperatorKlasse <Mathematik>Gebäude <Mathematik>Leistung <Physik>KontrollstrukturVisualisierungTwitter <Softwareplattform>GraphFrequenzComputeranimation
02:44
VorgehensmodellEndliche ModelltheorieVirtuelle MaschineComputeranimation
03:05
VorgehensmodellSchlüsselverwaltungSchätzungÄquivalenzklasseVerteilungsfunktionDistributionenraumTopologieSelbst organisierendes SystemLeistungsbewertungFunktion <Mathematik>ApproximationOperations ResearchFehlermeldungLineare AbbildungDatenmodellGebundener ZustandDrucksondierungTabelleNichtlinearer OperatorDistributionenraumKonstruktor <Informatik>PunktDivergente ReiheGamecontrollerKartesische KoordinatenOrtsoperatorCASE <Informatik>ZeitstempelBeanspruchungVorgehensmodellTopologieFlächeninhaltLinearisierungURLMultiplikationsoperatorSchlüsselverwaltungEinfache GenauigkeitAbstandKonditionszahlPrototypingValiditätAlgorithmusElement <Gruppentheorie>VariableLeistungsbewertungZahlenbereichMinkowski-MetrikImplementierungGlobale OptimierungFehlermeldungFunktion <Mathematik>MultifunktionRechter WinkelEin-AusgabeLokales MinimumTermEinfügungsdämpfungGüte der AnpassungZeitkomplexitätFunktionalVerteilungsfunktionCharakteristisches PolynomTwitter <Softwareplattform>Lineare RegressionApproximationFontResultanteMusterspracheBitEndliche ModelltheorieDifferenteQuick-SortSchnittmengeMathematikOpen SourceRechenschieberDrucksondierungBeweistheorieParametersystemSpeicherabzugSchwellwertverfahrenNebenbedingungTUNIS <Programm>ZweiFitnessfunktionDatenstrukturComputeranimation
10:43
Operations ResearchFehlermeldungBruchrechnungWeb-SeiteTopologieWeb logPufferspeicherDistributionenraumOffene MengeMinkowski-MetrikOrdnung <Mathematik>GrößenordnungAutomatische IndexierungBeweistheorieATMLeistungsbewertungGoogolRestgliedOrakel <Informatik>Twitter <Softwareplattform>Element <Gruppentheorie>TopologieBitWeb logBenutzerbeteiligungFitnessfunktionReelle ZahlMultigraphSchlüsselverwaltungEinfügungsdämpfungDivergente ReiheStrategisches SpielWeb-SeiteSelbst organisierendes SystemSystemaufrufSoftwareentwicklerResultanteKonditionszahlNebenbedingungCodecMusterspracheAlgorithmusURLFehlermeldungWeb SiteBeweistheorieCASE <Informatik>DistributionenraumOrtsoperatorMinkowski-MetrikLeistungsbewertungEndliche ModelltheorieBeanspruchungPuffer <Netzplantechnik>OptimierungsproblemGrößenordnungSequentielle SucheSuchverfahrenZahlenbereichCodeDatenstrukturMetrisches SystemFokalpunktAdditionPotenz <Mathematik>DifferenteFlächeninhaltOrdnung <Mathematik>Mapping <Computergraphik>Offene MengeSchnittmengeUltraviolett-PhotoelektronenspektroskopieDrucksondierungKartesische KoordinatenComputeranimation
18:22
Computeranimation
Transkript: Englisch(automatisch erzeugt)
00:04
Indexing has always been really a core part of database research, it's been discussed at these conferences really since the beginning of time and since databases started becoming mainstream and that's kind of evidenced by the fact that we have a whole session here devoted to indexing at SIGMOD and similarly
00:23
at other international database conferences. So it shouldn't really be surprising that there's been a lot of work around indexing both in academia and in industry and this has kind of gone along the lines of developing index structures for specific types of data like time series data or geospatial data
00:41
or trying to improve the performance, kind of the low level performance of index structures. But if we really think about the existing work around indexing, what we found is that there's one key piece that's missing if you look at all of the existing work. And that key piece is that the existing index structures
01:03
really aren't taking advantage of the data that they're operating on and in particular about the data distribution and the patterns that exist in the underlying data. So these existing index structures really aren't taking advantage of these patterns in order to be more performant and that can either be in terms of latency
01:20
or in terms of size. So as an example, my colleagues and I installed about 100 IoT sensors in our building at Brown and these monitor things like every single time a door is open or closed or there's motion detected in a classroom or the power draw of devices like projectors and printers and things like that.
01:43
So if you start looking at this data and what it actually looks like, we looked at existing index structures, how they would index this data and if you start looking at the patterns, you see some interesting patterns that come up. So along the x-axis here in this graph, we have the timestamp of an IoT event. So this is the timestamp basically of an event
02:02
like a door being opened or motion detected in a classroom. Along the x-axis, we have the position of that element in a sorted array. So I'm gonna be talking about primary key indexing for the first little bit but I'll talk about secondary indexes a little bit later as well. So what we see here from this visualization
02:21
is that we have these trends that exist in the data. We have more activity, for example, during the daytime than at nighttime. We see other patterns like activity levels change based on the day of the week when classes are in session. We see things like summer and winter break patterns and during finals period, we see some pretty strange patterns
02:41
with activity all over the place. So recently, there's been a lot of work around learned indexes, this idea of learned indexes that was presented last SIGMOD and the main idea here is to basically start to use machine learning models to replace index structures and in particular, this work looked at using
03:01
deep learning models to replace indexes. So if we think about what this model actually is, it's actually what I showed you previously on the last slide which is the key distribution here. So again, along the x-axis, you have all of the keys in your data set and then along the y-axis, you have the position of that element in a sorted array
03:23
and again, I'm gonna be talking about a clustered or a primary key index using that as an example here but our techniques I'll show can generalize to secondary indexes as well. So we can think of this distribution as the cumulative distribution function or CDF
03:41
of the observed key distribution and in this case, we can use simple probability to determine the approximate position of any element in this data set. So to do this, we can simply ask what's the probability that a random variable x is less than or equal to the key that we're looking for.
04:01
We can then multiply that probability by the number of keys in our data set and again, we get the approximate position of any element in this sorted array. So this is kind of the main idea behind fitting tree which is a new index structure that we've been developing. The idea here is we wanna take advantage
04:20
of those patterns and that we're gonna use this sort of model to take advantage of these patterns. So as an outline for this talk, I'm gonna talk first about how we approximate that function or that key distribution that I was just talking about and that's our segmentation algorithm. I'm then gonna talk about how we organize the segments that we generate from that segmentation algorithm.
04:40
I'm then gonna talk about the core index operations that every index needs to support and that's looking up data elements and inserting new data elements and then finally, I'm gonna talk about evaluation and some related work. So first, we're gonna talk about how we approximate that distribution and our algorithm for doing so.
05:01
So given the formulation that we've been looking at here which is this key distribution with the timestamps along the x-axis and the position along the y-axis, fitting tree uses a series of piecewise linear functions to partition this key space here, the timestamps in this example into a series of disjoint linear segments.
05:23
So the reason for doing this is that it makes it really easy to model the underlying data distribution to capture those trends and importantly, we show that we can offer really good performance in terms of both lookups and inserts. So that's kind of the rationale for these linear functions.
05:42
So during the segmentation process and creating these linear functions, there's some error associated with our approximation. So more formally, we define the error as the maximum distance between the predicted and the actual location of any key in our entire dataset. So this error is a parameter that the user has full control over.
06:02
They can specify the error at construction time. This error is always ensured throughout the entire duration of the index and it's this tuning knob that the user has control over which allows them to balance the space consumption of the index and the performance of the index.
06:20
So why do we choose to go with linear models and piecewise linear approximation? Well, there's several different reasons. The first of which is we want it to be very efficient. We wanted it to be easy to build, easy to update and easy to maintain. We wanted to support insert operations which existing work hasn't really considered. And lastly, we wanted to be able to provide
06:41
that tuning knob where users can control the worst case performance. They can actually control the performance of the index and in the paper we show how we can actually model the latency and size characteristics of an index so that they have full control over it and kind of adapt the index to the underlying data and the workload that's being run.
07:06
So in the paper we describe our segmentation algorithm which we call shrinking cone which again produces those variable size segments and I'm just gonna quickly walk you through it here. So again, the input to this algorithm is the data set
07:22
and the output is a series of these variable size segments. So here's our visualization that we've been working with so far. So the first point in our data set starts the beginning of a new segment and this cone that you see visualized here which is just a right angle for the first point
07:40
represents the space in which new data elements can be included in this segment without violating that error constraint that I talked about before. So as we add new data points to this segment this cone gets smaller and smaller and once a point falls outside of that cone that error threshold is no longer validated
08:03
and we have to start a new segment. So in this case we start adding a second point and that's within this cone that we drew before so it can be included in the segment. We then keep adding points and again this cone shrinks and as long as this point is within this shrinking cone it's a valid segment and satisfies that error condition.
08:24
So similarly we still have points inside this segment so we can still keep going but then once we reach a point that lies outside of that outside of that cone that error is no longer validated and so we have to start a new segment. So in this case we make that segment
08:42
and that point that fell outside starts the beginning of a new segment. So this is not an optimal algorithm in that it doesn't always produce the optimal amount or the fewest number of segments for a given data set but what we show in the paper and through some proofs and experimental evaluation
09:00
we show that actually in practice it's really good and the main benefit here is it's really fast, it's a linear time algorithm that's very simple to do so this makes it easy to build the index in the first place and then also add new elements which I'm gonna talk about later. So now I'm gonna move on and talk about how we organize these segments that we generate
09:23
so internally how we organize these variable size segments. So after the segmentation process and that algorithm I just talked about we're left with these series of variable size segments and what we use in the paper and we describe in our basic prototype
09:40
is we just use a simple B plus tree to index these segments and this is just to be able to find the corresponding segment for a single data point. So this is again just to organize these segments be able to look up data elements and to be able to insert new elements and find the corresponding segment for a key. So we used STX tree
10:01
which was a basic B plus tree implementation open source widely used because it offers pretty reasonable performance for lookups and inserts but in the paper and in our experiments we actually show we can use any other index structure or data structure internally to organize these segments and so depending on the workload
10:20
for example if it's read only or if it's write heavy we can change this index structure and see different results. So now I'm gonna move on and talk a little bit about the core index operations that we wanted to be able to support which are looking up a data element and importantly inserting a new data element.
10:42
So let's say for example you wanted to look up the position of an element in this case let's say we wanted to look up the position of key 12 in our data set. So the first step is to ask that internal data structure the B plus tree in this case for the corresponding segment where that data item belongs.
11:03
So then since as I talked about before each segment is linear and is defined by that slope and we know a corresponding error we can interpolate within that segment to the approximate position of where that data element should be. So in this case for the data item 12
11:21
it should be about somewhere in the middle of this segment so we get the approximate position of that element and then since the error is predefined and is a constant we have a bounded region that we need to search in order to find that element if it exists in the data. And to do this we use usually binary search but actually you can use any different search algorithm
11:43
linear search if the error is small or exponential search really there you can swap out any different search algorithm for that. So in addition to look ups inserts is really a crucial piece of any index structure and the other work in the past
12:02
really hasn't focused too much on insert performance for these index structures that model the data distribution. So if we think about the way B plus trees do it we basically keep each page each page basically has some space reserved in the page to handle newly inserted data items
12:22
but if we think about the way fitting tree works we need to do things a little bit differently and there are a few reasons why. So first of all it's very possible that these linear segments get to be quite large so doing in place updates where you need to move data or copy data to free spaces can get very expensive
12:43
if these segments are large and have a lot of elements and that can happen because the data is roughly linear or it could also happen because the error is really large so in essence the in place updates can be pretty expensive. And lastly and most importantly when inserting a new data item
13:00
we always have to make sure that we satisfy that error condition that the user specified when they created the index in the first place. So in the paper we talk about two different insert strategies the first of which is in place inserts which I just talked about briefly which really is impractical in many situations but it is for a small set
13:22
that we kind of show the trade off between this and our other strategy. So the basic idea here is similar to a B plus tree we're gonna have free space at the beginning and at the end of each page and to insert a new data element we find where that new element should go in this data structure here in this segment
13:42
and we simply shift the elements either to the left or to the right depending on which side is closer and which would result in copying the fewest number of elements. So I talked about before why this is potentially bad is these segments with a large error or if the data is roughly linear can get really big and then moving all of these elements
14:01
can be very expensive. So we insert the data item and copy everything over. So once we have filled up all of the free space within a page what we do in this case is we basically resegment so that's running that algorithm the shrinking cone algorithm I talked about before resegment the segment into a series
14:22
of either one, two or maybe even several new segments each of which has free space at the beginning and at the end of the page that can handle updates. So we also presented a delta insert strategy that tries to combat that problem that I talked about before in that each segment here has a buffer
14:43
that is kept sorted that we insert the new data items into. So there's no copying that has to go on here and what we do is we simply insert data items into this buffer and once that buffer gets filled up we resegment the segment again just like before and create a series of data items
15:02
of new segments each of which has space to accommodate new data items. So now I'm gonna move on and talk a bit about our evaluation and our related work. So it shouldn't be surprising that since fitting tree is pretty data dependent the data that we actually use to evaluate
15:21
our approach is really important and for that we use three different data sets the first of which was the IOT data set that I talked about and have been using throughout the talk here. The second is a web logs data set that contains entries or hits to our department website for the past 14 years so you see some interesting patterns there
15:42
that you don't see in the IOT data set and the last data set is open street maps which basically contains the location of various features around the world things like coffee shops or museums and you see also some different trends in that data set. So a few quick results here
16:01
and these are from the paper so what we have here is we compared fitting tree to an index that uses fixed size paging so that's as I mentioned before basically you're gonna take every end data elements and create a page, index that page and for that we use the same internal organization structure that we use in our tree
16:22
so in that case STX tree. So fitting tree compared to an index that uses fixed size paging as well as what I'll call a full index or a dense index where every single key is present in the index and then we have binary search as a kind of as a baseline here.
16:40
So all of these graphs what you see on the x-axis is the index size and that's achieved by tuning either the error for our index or the page size for a B plus tree that uses fixed size paging and then you see the lookup performance along the y-axis here. So the main takeaways from our evaluation
17:01
and there's a lot more in the paper is that we can effectively trade off the performance and the size of our index structure to really adapt to a given workload. We show that we're able to achieve order of magnitude space savings compared to both a full index and an index that uses fixed size paging
17:20
using these real world data sets and that's by again taking advantage of these trends that exist in the data and the performance is actually surprisingly good. We're able to achieve performance that's pretty comparable to that of a full index which can be thought of as the best case. So there's a lot more in the paper that I wasn't able to talk about in this talk
17:40
mostly around secondary indexes. We have techniques for that that I wasn't able to talk about. We have proofs in our paper that show how far from the optimal solution we are in our segmentation algorithm. We present in the paper a cost model that allows the user to specify either a latency requirement or a size constraint
18:01
and then this cost model will choose a corresponding error that they can use for the index that will satisfy those conditions and then there's a lot more evaluation on our insert performance and other metrics there. So with that I'm happy to take any questions.