We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Random Sampling and Size Estimation Over Cyclic Joins

00:00

Formale Metadaten

Titel
Random Sampling and Size Estimation Over Cyclic Joins
Serientitel
Anzahl der Teile
25
Autor
Lizenz
CC-Namensnennung 3.0 Deutschland:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
Computing joins is expensive, and often unnecessary when the output size is large. In 1999, Chaudhuri et al. [Surajit Chaudhuri et al., 1999] posed the problem of random sampling over joins as a potentially effective approach to avoiding computing the join in full, while obtaining important statistical information about the join results. Unfortunately, no significant progress has been made in the last 20 years, except for the case of acyclic joins. In this paper, we present the first non-trivial result on sampling over cyclic joins. We show that after a linear-time preprocessing step, a join result can be drawn uniformly at random in expected time O(IN^ρ/OUT), where IN^ρ is known as the AGM bound of the join and OUT is its output size. This result holds for all joins on binary relations, as well as certain joins on relations of higher arity. We further show how this algorithm immediately leads to a join size estimation algorithm with the same running time.
Attributierte GrammatikRelativitätstheorieFunktion <Mathematik>KomplexitätstheorieTotal <Mathematik>StichprobenumfangAbfrageInstantiierungVollständiger VerbandMultifunktionDomain <Netzwerk>Attributierte GrammatikGraphRelativitätstheorieFunktion <Mathematik>GrundraumStichprobenumfangAbfrageHypercubeInstantiierungRandomisierungEin-AusgabeVollständiger VerbandKnotenmengeComputeranimation
DatenbankStichprobenumfangAbfrageLinearisierungStochastische AbhängigkeitTupelAttributierte GrammatikDatenbankGraphRelativitätstheorieFunktion <Mathematik>StichprobenumfangAbfrageLinearisierungStochastische AbhängigkeitHypercubeInstantiierungEin-AusgabeVollständiger VerbandPräprozessorKnotenmengeTupelComputeranimation
Computeranimation
BruchrechnungFunktion <Mathematik>DreieckLeistung <Physik>AbfrageÜberlagerung <Mathematik>CASE <Informatik>DreieckComputeranimation
AlgorithmusRelativitätstheorieStochastikWellenpaketGradientenverfahrenResultanteStichprobenumfangZahlenbereichKonstanteGradientKartesische KoordinatenRichtungVollständiger VerbandSchätzfunktionFramework <Informatik>DatenbankStochastikDreieckGradientenverfahrenGradientKartesische KoordinatenVollständiger VerbandSchätzfunktionComputeranimation
AlgorithmusRelativitätstheorieStichprobenumfangLuenberger-BeobachterVollständiger VerbandRelativitätstheorieStichprobenumfangVollständiger VerbandComputeranimation
Gewicht <Ausgleichsrechnung>TupelRelativitätstheorieStichprobenumfangGewicht <Ausgleichsrechnung>Vollständiger VerbandPräprozessorComputeranimation
StichprobenumfangGewicht <Ausgleichsrechnung>TupelRelativitätstheorieStichprobenumfangVollständiger VerbandPräprozessorTupelComputeranimation
AlgorithmusUniformer RaumResultanteStichprobenumfangKonstanteGewicht <Ausgleichsrechnung>MultiplikationsoperatorStandardabweichungRelativitätstheorieStichprobenumfangVollständiger VerbandPräprozessorTupelComputeranimation
GeradeResultanteStichprobenumfangZahlenbereichLuenberger-BeobachterSchnittmengeVollständiger VerbandTupelStichprobenumfangGewicht <Ausgleichsrechnung>ResiduumVollständiger VerbandPräprozessorComputeranimation
LinearisierungResiduumVollständiger VerbandStichprobenumfangResiduumVollständiger VerbandNeuroinformatikComputeranimation
IrrfahrtsproblemResultanteStichprobenumfangSchätzungLuenberger-BeobachterGebundener ZustandMultiplikationsoperatorInverseStichprobenumfangProzess <Informatik>ResiduumGebundener ZustandVollständiger VerbandNeuroinformatikComputeranimation
ResultanteZahlenbereichAbfrageLinearisierungKonstanteResiduumMultiplikationsoperatorGraphUniformer RaumResultanteStichprobenumfangAbfrageLinearisierungProzess <Informatik>ResiduumGebundener ZustandVollständiger VerbandPräprozessorNeuroinformatikComputeranimation
BruchrechnungResultanteTeilbarkeitMatchingÜberlagerung <Mathematik>PunktOffene MengeRichtungOffene MengeVollständiger VerbandComputeranimation
DatensatzTopologieTypentheorieResultanteAbfrageLinearisierungHelmholtz-ZerlegungMultiplikationsoperatorGebundener ZustandOffene MengeVollständiger VerbandPräprozessorComputeranimation
Transkript: English(automatisch erzeugt)
Okay, the title of our paper is random sampling over cyclic joins. So I assume everyone knows what a join query is. What I'm showing here is what is called a two-triangle join that consists of five attributes, a, b, c, d, e, and six relations. So this example, so every relation has two attributes, so this is also called every
t2 join. So here I'm showing an instance of the query where every edge represents a tuple, and every vertex represents a value in the corresponding domain of attribute. So I write in as total input size and out as the total output size, maybe the join
size. So the problem we want to study in this paper is following, we want to preprocess the database instance in linear time, ideally, so that upon request, a tuple uniformly independently chosen from the query can be returned efficiently.
So for example, if I ask a query, this join result, shown in blue, may be returned. If I ask multiple times, then multiple samples can be returned independently. So why are we interested in this problem? So first of all, this is a theoretical clean problem. It's indeed first proposed in 1999.
So second, so computing the full join can be rather expensive, right? If the join is acyclic, it takes in plus out time. And when the query is acyclic, then the two approaches, one is known as the AGM bound,
so where rho is the fractional edge cover. And I will also come back to this later in this talk. The other one is just with dependent bound, where w essentially measures how acyclic the join is. So for example, for a two triangle join, the rho is 2.5 and w is 1.5.
However, in the worst case, the output size can be as large as the AGM bound into power rho. So this computing the full join can be very expensive, at least in the worst case. So however, in many applications, we don't really need the full join results.
There are many, for example, in this paper, we actually also showed how to compute an estimate of join size only by a constant number of samples. And furthermore, that application is so-called online aggregation, namely we want to answer
an aggregate query, however, exactly now it is not required. So this framework shows that if you can do sampling, then you can return more and more accurate answer as more samples are returned.
So finally, this new direction recently called in-database learning. So in this setting, we assume that the training data is represented as the join results of several relations. And a dominant approach in learning in training algorithm is stochastic gradient descent,
which essentially only relies on the sample. So if you can sample from the join, you can also run stochastic gradient descent over training data that is defined as a join over several relations.
So the very, the 1999 paper indeed already proposed an algorithm for sampling from a two-relation join, say AB, join BC. So their observation is the following, if you want to uniformly sample from all the results, you cannot uniformly sample from the two relations respectively, because the
importance of the samples are not the same. So they proposed this idea called weighted sampling, which is actually very natural. So for example, in this instance, one tuple in R1, the first one, joins with n tuples
on the other side. A0, B0, this tuple is just more important than the rest. So therefore, the idea is to give them weights. So the weights is exactly its so-called degree, maybe how many tuples it can join
on the other side. So the way to do this is we start from sampling a tuple in R1, and the sampling probability will be set as follows, the sampling probability of the first tuple should be one half, and the sampling probability of each of the other n tuples is one over
two n. So the weight given to the first tuple is n times that of the rest of tuples. So let's say suppose the first tuple is indeed sampled, after that, then we want to sample one of its neighbors.
The neighbors, in this case, will all have the same weight, so each tuple will be sampled with probably one over n after the first tuple in the R1 has been sampled. So if you work out the math, you can convince yourself that by using these weights, each join result will be sampled with probability one over two n, and that's all the same.
Okay, how do we implement this algorithm? There is a kind of standard weighted sampling algorithm that says after linear time pre-processing, then a weighted sample can be joined in constant time, so the final result is that
after linear pre-processing time, you can draw uniform samples in constant time. Okay, so the same idea can be extended to a multi-weight join, so let's say a line
join, in this case, R1, R2, and R3. The idea is to use the residual join sizes as the weights, is the number of join results a tuple can lead to, so for example, every tuple in the last relation, they only lead to one join result because there's nothing further on the right side, so each tuple
in R2, for example, this tuple can join with four tuples in R3, so this will lead to four join results, and this one will lead to three, and this one will lead to two,
and this tuple will lead to nine because it can join with four plus three plus two results, and similarly, also this one. Okay, so the idea is we can start from the first R1 and do this kind of a weighted random walk, we sample each of these two tuples with probability nine over 18, and
then from there, we pick one of these three tuples with probability proportional to four, three, and two, and et cetera, and then this can be further extended to any acyclic join, okay, the observation is that all the residual join sets can be computed by dynamic programming, actually a fairly standard technique, okay, so now
we go to the main problem we are trying to solve in this paper, namely what if the join is cyclic, okay, so the difficulty in extending the previous idea to cyclic joins
is that we don't know how to compute all the residual join sizes in linear time, okay, we can only do that when the query, when the join is acyclic, okay, so the idea of getting around this difficulty is to use upper bounds on the residual join sizes to guide the sampling
process, but because we use a overestimate, so that means we may not always succeed, so more precisely, the success probability in reaching a true join result in the sampling process, in the random walk process, is actual join size out over essentially our overestimate
over the AGM bound of the query, okay, so therefore the expected sampling time, so how many times we have to try until we succeed is the inverse of the success probability, so then the running time will be AGM over out, okay, so and then the observation that this AGM bound,
we can compute the residual AGM bound in constant time per residual join, and there are only linear number of residual joins, so we can pre-process, we can compute all the residual
AGM bounds in linear time, okay, and after that, our sampling time will be will be this way, so AGM bound over out, okay, so this result holds for all the arity two joins, okay, namely when the query is a graph, it also holds for certain higher arity joins, but because certain technical difficulties, we cannot extend this to all the higher arity joins,
so finally, I'd like to conclude with two open problems, of course, the first one is, you know, to extend this result to all the high higher arity joins, so a good starting point will be the LW join, okay, for the LW join, the fraction match cover is a four-third,
and however, we cannot achieve this ideal bound yet, our bound is a n factor larger, so it will be nice to to to get rid of this one, the second, I think another more interesting direction is to study with dependent bound, right, because, you know, if you look at
full join computation, right, so this is just full join computation, then for acyclic and cyclic, you know, we have essentially two two extremes, however, by using this GHD, generalized type tree decomposition framework, we can get a result running time that depends on the cyclicity
of the query, so W is actually one when the query is acyclic, and W is high, W is low in the, for very cyclic, for highly cyclic queries, okay, however, on the, for the join sampling problem, well, for the acyclic case, we can achieve very nice results, linear
time pre-processing, constant sampling time, for cyclic ones, you know, on one stream, we can, you know, compute the full join, which takes into the row time, and then we can do sampling very efficiently, so what we have shown in this paper is that we don't have to compute the full join, with only linear pre-processing, then we can still achieve some
non-trivial sampling time, okay, for, however, for queries that are not highly cyclic, but also not acyclic either, we currently, we only have this kind of naive solution, which is to compute, you know, the node, every node in the, in the, in the join tree,
okay, and then we can need so much running pre-processing time and space, okay, and then from that, we can just apply this result and get a constant time, okay, it'll be very nice if we can achieve some, some width-dependent bound for sampling time, given only
linear time pre-processing, okay, and so that's all in the talk, thank you.