Random Sampling and Size Estimation Over Cyclic Joins
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 25 | |
Autor | ||
Lizenz | CC-Namensnennung 3.0 Deutschland: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/46833 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
2
00:00
Attributierte GrammatikRelativitätstheorieFunktion <Mathematik>KomplexitätstheorieTotal <Mathematik>StichprobenumfangAbfrageInstantiierungVollständiger VerbandMultifunktionDomain <Netzwerk>Attributierte GrammatikGraphRelativitätstheorieFunktion <Mathematik>GrundraumStichprobenumfangAbfrageHypercubeInstantiierungRandomisierungEin-AusgabeVollständiger VerbandKnotenmengeComputeranimation
00:44
DatenbankStichprobenumfangAbfrageLinearisierungStochastische AbhängigkeitTupelAttributierte GrammatikDatenbankGraphRelativitätstheorieFunktion <Mathematik>StichprobenumfangAbfrageLinearisierungStochastische AbhängigkeitHypercubeInstantiierungEin-AusgabeVollständiger VerbandPräprozessorKnotenmengeTupelComputeranimation
01:12
Computeranimation
01:22
BruchrechnungFunktion <Mathematik>DreieckLeistung <Physik>AbfrageÜberlagerung <Mathematik>CASE <Informatik>DreieckComputeranimation
02:16
AlgorithmusRelativitätstheorieStochastikWellenpaketGradientenverfahrenResultanteStichprobenumfangZahlenbereichKonstanteGradientKartesische KoordinatenRichtungVollständiger VerbandSchätzfunktionFramework <Informatik>DatenbankStochastikDreieckGradientenverfahrenGradientKartesische KoordinatenVollständiger VerbandSchätzfunktionComputeranimation
03:38
AlgorithmusRelativitätstheorieStichprobenumfangLuenberger-BeobachterVollständiger VerbandRelativitätstheorieStichprobenumfangVollständiger VerbandComputeranimation
04:08
Gewicht <Ausgleichsrechnung>TupelRelativitätstheorieStichprobenumfangGewicht <Ausgleichsrechnung>Vollständiger VerbandPräprozessorComputeranimation
04:45
StichprobenumfangGewicht <Ausgleichsrechnung>TupelRelativitätstheorieStichprobenumfangVollständiger VerbandPräprozessorTupelComputeranimation
05:31
AlgorithmusUniformer RaumResultanteStichprobenumfangKonstanteGewicht <Ausgleichsrechnung>MultiplikationsoperatorStandardabweichungRelativitätstheorieStichprobenumfangVollständiger VerbandPräprozessorTupelComputeranimation
06:11
GeradeResultanteStichprobenumfangZahlenbereichLuenberger-BeobachterSchnittmengeVollständiger VerbandTupelStichprobenumfangGewicht <Ausgleichsrechnung>ResiduumVollständiger VerbandPräprozessorComputeranimation
07:45
LinearisierungResiduumVollständiger VerbandStichprobenumfangResiduumVollständiger VerbandNeuroinformatikComputeranimation
08:10
IrrfahrtsproblemResultanteStichprobenumfangSchätzungLuenberger-BeobachterGebundener ZustandMultiplikationsoperatorInverseStichprobenumfangProzess <Informatik>ResiduumGebundener ZustandVollständiger VerbandNeuroinformatikComputeranimation
09:04
ResultanteZahlenbereichAbfrageLinearisierungKonstanteResiduumMultiplikationsoperatorGraphUniformer RaumResultanteStichprobenumfangAbfrageLinearisierungProzess <Informatik>ResiduumGebundener ZustandVollständiger VerbandPräprozessorNeuroinformatikComputeranimation
09:46
BruchrechnungResultanteTeilbarkeitMatchingÜberlagerung <Mathematik>PunktOffene MengeRichtungOffene MengeVollständiger VerbandComputeranimation
10:21
DatensatzTopologieTypentheorieResultanteAbfrageLinearisierungHelmholtz-ZerlegungMultiplikationsoperatorGebundener ZustandOffene MengeVollständiger VerbandPräprozessorComputeranimation
Transkript: English(automatisch erzeugt)
00:00
Okay, the title of our paper is random sampling over cyclic joins. So I assume everyone knows what a join query is. What I'm showing here is what is called a two-triangle join that consists of five attributes, a, b, c, d, e, and six relations. So this example, so every relation has two attributes, so this is also called every
00:23
t2 join. So here I'm showing an instance of the query where every edge represents a tuple, and every vertex represents a value in the corresponding domain of attribute. So I write in as total input size and out as the total output size, maybe the join
00:44
size. So the problem we want to study in this paper is following, we want to preprocess the database instance in linear time, ideally, so that upon request, a tuple uniformly independently chosen from the query can be returned efficiently.
01:01
So for example, if I ask a query, this join result, shown in blue, may be returned. If I ask multiple times, then multiple samples can be returned independently. So why are we interested in this problem? So first of all, this is a theoretical clean problem. It's indeed first proposed in 1999.
01:23
So second, so computing the full join can be rather expensive, right? If the join is acyclic, it takes in plus out time. And when the query is acyclic, then the two approaches, one is known as the AGM bound,
01:41
so where rho is the fractional edge cover. And I will also come back to this later in this talk. The other one is just with dependent bound, where w essentially measures how acyclic the join is. So for example, for a two triangle join, the rho is 2.5 and w is 1.5.
02:01
However, in the worst case, the output size can be as large as the AGM bound into power rho. So this computing the full join can be very expensive, at least in the worst case. So however, in many applications, we don't really need the full join results.
02:24
There are many, for example, in this paper, we actually also showed how to compute an estimate of join size only by a constant number of samples. And furthermore, that application is so-called online aggregation, namely we want to answer
02:44
an aggregate query, however, exactly now it is not required. So this framework shows that if you can do sampling, then you can return more and more accurate answer as more samples are returned.
03:01
So finally, this new direction recently called in-database learning. So in this setting, we assume that the training data is represented as the join results of several relations. And a dominant approach in learning in training algorithm is stochastic gradient descent,
03:25
which essentially only relies on the sample. So if you can sample from the join, you can also run stochastic gradient descent over training data that is defined as a join over several relations.
03:41
So the very, the 1999 paper indeed already proposed an algorithm for sampling from a two-relation join, say AB, join BC. So their observation is the following, if you want to uniformly sample from all the results, you cannot uniformly sample from the two relations respectively, because the
04:05
importance of the samples are not the same. So they proposed this idea called weighted sampling, which is actually very natural. So for example, in this instance, one tuple in R1, the first one, joins with n tuples
04:25
on the other side. A0, B0, this tuple is just more important than the rest. So therefore, the idea is to give them weights. So the weights is exactly its so-called degree, maybe how many tuples it can join
04:43
on the other side. So the way to do this is we start from sampling a tuple in R1, and the sampling probability will be set as follows, the sampling probability of the first tuple should be one half, and the sampling probability of each of the other n tuples is one over
05:04
two n. So the weight given to the first tuple is n times that of the rest of tuples. So let's say suppose the first tuple is indeed sampled, after that, then we want to sample one of its neighbors.
05:20
The neighbors, in this case, will all have the same weight, so each tuple will be sampled with probably one over n after the first tuple in the R1 has been sampled. So if you work out the math, you can convince yourself that by using these weights, each join result will be sampled with probability one over two n, and that's all the same.
05:47
Okay, how do we implement this algorithm? There is a kind of standard weighted sampling algorithm that says after linear time pre-processing, then a weighted sample can be joined in constant time, so the final result is that
06:04
after linear pre-processing time, you can draw uniform samples in constant time. Okay, so the same idea can be extended to a multi-weight join, so let's say a line
06:21
join, in this case, R1, R2, and R3. The idea is to use the residual join sizes as the weights, is the number of join results a tuple can lead to, so for example, every tuple in the last relation, they only lead to one join result because there's nothing further on the right side, so each tuple
06:47
in R2, for example, this tuple can join with four tuples in R3, so this will lead to four join results, and this one will lead to three, and this one will lead to two,
07:00
and this tuple will lead to nine because it can join with four plus three plus two results, and similarly, also this one. Okay, so the idea is we can start from the first R1 and do this kind of a weighted random walk, we sample each of these two tuples with probability nine over 18, and
07:21
then from there, we pick one of these three tuples with probability proportional to four, three, and two, and et cetera, and then this can be further extended to any acyclic join, okay, the observation is that all the residual join sets can be computed by dynamic programming, actually a fairly standard technique, okay, so now
07:47
we go to the main problem we are trying to solve in this paper, namely what if the join is cyclic, okay, so the difficulty in extending the previous idea to cyclic joins
08:00
is that we don't know how to compute all the residual join sizes in linear time, okay, we can only do that when the query, when the join is acyclic, okay, so the idea of getting around this difficulty is to use upper bounds on the residual join sizes to guide the sampling
08:21
process, but because we use a overestimate, so that means we may not always succeed, so more precisely, the success probability in reaching a true join result in the sampling process, in the random walk process, is actual join size out over essentially our overestimate
08:46
over the AGM bound of the query, okay, so therefore the expected sampling time, so how many times we have to try until we succeed is the inverse of the success probability, so then the running time will be AGM over out, okay, so and then the observation that this AGM bound,
09:08
we can compute the residual AGM bound in constant time per residual join, and there are only linear number of residual joins, so we can pre-process, we can compute all the residual
09:20
AGM bounds in linear time, okay, and after that, our sampling time will be will be this way, so AGM bound over out, okay, so this result holds for all the arity two joins, okay, namely when the query is a graph, it also holds for certain higher arity joins, but because certain technical difficulties, we cannot extend this to all the higher arity joins,
09:47
so finally, I'd like to conclude with two open problems, of course, the first one is, you know, to extend this result to all the high higher arity joins, so a good starting point will be the LW join, okay, for the LW join, the fraction match cover is a four-third,
10:05
and however, we cannot achieve this ideal bound yet, our bound is a n factor larger, so it will be nice to to to get rid of this one, the second, I think another more interesting direction is to study with dependent bound, right, because, you know, if you look at
10:24
full join computation, right, so this is just full join computation, then for acyclic and cyclic, you know, we have essentially two two extremes, however, by using this GHD, generalized type tree decomposition framework, we can get a result running time that depends on the cyclicity
10:42
of the query, so W is actually one when the query is acyclic, and W is high, W is low in the, for very cyclic, for highly cyclic queries, okay, however, on the, for the join sampling problem, well, for the acyclic case, we can achieve very nice results, linear
11:04
time pre-processing, constant sampling time, for cyclic ones, you know, on one stream, we can, you know, compute the full join, which takes into the row time, and then we can do sampling very efficiently, so what we have shown in this paper is that we don't have to compute the full join, with only linear pre-processing, then we can still achieve some
11:24
non-trivial sampling time, okay, for, however, for queries that are not highly cyclic, but also not acyclic either, we currently, we only have this kind of naive solution, which is to compute, you know, the node, every node in the, in the, in the join tree,
11:43
okay, and then we can need so much running pre-processing time and space, okay, and then from that, we can just apply this result and get a constant time, okay, it'll be very nice if we can achieve some, some width-dependent bound for sampling time, given only
12:03
linear time pre-processing, okay, and so that's all in the talk, thank you.