Pessimistic Cardinality Estimation
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 155 | |
Author | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/42957 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
SIGMOD 201976 / 155
34
37
44
116
120
122
144
148
155
00:00
Magneto-optical driveData managementEstimationBound stateQuery languageProjective planeClassical physicsEstimatorLecture/Conference
00:13
Query languageUniformer RaumIndependence (probability theory)EstimatorQuery languageTheory of relativityMultiplicationCASE <Informatik>Product (business)Computer animation
00:24
Mathematical optimizationQuery languageUniformer RaumIndependence (probability theory)Bound stateGraph (mathematics)Library catalogEntropie <Informationstheorie>Variable (mathematics)LogarithmUniform convergenceAttribute grammarFunction (mathematics)Exponential functionDegree (graph theory)SoftwareGraph (mathematics)Nichtlineares GleichungssystemRandomizationPositional notationQuery languageInstance (computer science)Coefficient of determinationBound statePattern languageWell-formed formulaAbsolute valueFunction (mathematics)SummierbarkeitRandom variableGleichverteilungTerm (mathematics)Group actionPresentation of a groupRow (database)Finite-state machineVariable (mathematics)Theory of relativityGraph (mathematics)PseudonymizationNumberChainInformationCountingEntropie <Informationstheorie>Casting (performing arts)Student's t-testTable (information)Distribution (mathematics)Key (cryptography)Cross-correlationUniformer RaumCombinational logicMultiplication signQuicksortStatisticsProduct (business)ExponentiationProjective planeMaxima and minimaInequality (mathematics)Insertion lossCondition numberMobile WebWorkloadDivisorReal numberMachine visionForm (programming)WordEuler anglesPlastikkarteSimilarity (geometry)CASE <Informatik>PlanningIndependence (probability theory)SkewnessAttribute grammarDifferent (Kate Ryan album)Complex (psychology)Validity (statistics)Series (mathematics)MultiplicationRight angleNormal (geometry)Run time (program lifecycle phase)Mathematical optimizationCharacteristic polynomialComputer animation
07:23
Partition (number theory)Bound stateMathematical optimizationQuery languageHash functionCompilation albumDatabaseStatisticsWell-formed formulaDigital filterPredicate (grammar)Run time (program lifecycle phase)Exponential functionGraph (mathematics)Performance appraisalBenchmarkSkewnessComplex systemTopologyCross-correlationError messageDefault (computer science)Maxima and minimaSubject indexingLinear mapProcess (computing)Scale (map)Limit (category theory)Exclusive orReal numberMaxima and minimaPlanningHistogramCross-correlationFehlerschrankeAverageDefault (computer science)SkewnessChemical equationDifferent (Kate Ryan album)Formal grammarPerformance appraisalRun time (program lifecycle phase)Form (programming)EstimatorBound statePredicate (grammar)Multiplication signQuery languageHash functionTheory of relativityCorrespondence (mathematics)Characteristic polynomialInstance (computer science)Endliche ModelltheorieNumberWorkloadSpacetime2 (number)BenchmarkEntropie <Informationstheorie>WordGraph (mathematics)Group actionGraph (mathematics)Table (information)CountingStatisticsDegree (graph theory)Well-formed formulaMathematical optimizationEvent horizonFunction (mathematics)SubsetGame theoryPartition (number theory)CASE <Informatik>Similarity (geometry)Block (periodic table)Derivation (linguistics)Grass (card game)Patch (Unix)1 (number)Point (geometry)SummierbarkeitCollisionoutputStrategy gameInformationKey (cryptography)Order of magnitudePropagatorSubject indexingOperator (mathematics)Inclusion mapProcess (computing)Cartesian coordinate systemVisualization (computer graphics)Nichtlineares GleichungssystemComplex (psychology)Electric generatorAttribute grammarData modelAlgorithmMessage passingComputer animation
Transcript: English(auto-generated)
00:02
Today we'll be talking about pessimistic query optimization, our project on how to generate tighter upper bounds on intermediate join cardinalities. The problem that we're focusing on today is the classic problem of cardinality estimation in multi-joint queries. In particular, when some of those relations in a multi-joint query have four-in-key, four-in-key joins.
00:23
Because in that case, some of the intermediate products might actually be larger than some of the underlying base tables themselves. And this blow-up, it's what really threatens long runtimes. So, in particular, the problems that optimizers still make, or the mistakes that optimizers still make, is that they assume, for instance,
00:41
uniformity across attribute value distributions in a table. Real-world data will show us that in fact there is skew. They also assume that there is independence across those distributions in between joining columns. Again, real-world data shows that in fact there is correlation. These strong assumptions about the underlying data lead to underestimation.
01:01
And, as you add more and more relations, this underestimation gets more and more severe, which means that we end up with large, or very aggressive query plans that end up being much slower when you look over the entire workload. So, in this project we asked the question, why not use bounds?
01:22
Now, using bounds is an idea that has already been had. Our main contribution is actually how to tighten these bounds. Or, more specifically, tighten these bounding formulas. However, the first thing we need to do is to actually understand how to generate these bounding formulas. And that's where we're going to begin. So, here's an example query. It's going to be the running example for this presentation.
01:42
Essentially, what we're doing is just we're finding combinations of pseudonyms, or different misspellings of an individual working in the movie industry, and companies with which that individual worked on a certain film. So, specifically, we have pseudonym, which relates those individuals to those misspellings, or pseudonyms.
02:02
Cast, which relates individuals to movies that they worked on, could be writers, could be actors, et cetera. Movie companies is similar. It relates those movies to the companies that were involved with that movie. And, finally, company name has the rest of the information about that company, because a company can work on several different movies.
02:21
We can view this query as a chain join with four relations. Two of the joins, well, two of the three joins, are four-in-key, four-in-key, and one is four-in-key, key, as marked by the colored edges. For the remainder of the presentation, we're going to refer to this query using the Datalog notation that we see in the upper right.
02:44
Now, the first step to actually generating join cardinality upper bounding formulas is to review entropy. So, as a review, entropy is a mathematical formulation that describes the amount of randomness in a random variable. So, for instance, this entropy, the first formula is a very exact normal definition of entropy.
03:06
You can make a similar definition for joint distributions across multiple joint distributions of multiple random variables. You can also make a definition over conditional random variables. What this means is essentially we have entropy of x given y. What we're saying
03:20
is what is the amount of randomness that x portrays if you hold y constant. If x and y were correlated, then the entropy of x given y should actually be less than the entropy of just x on its own. Now, if we're assuming that x is a discrete random variable over some finite space, there are some very convenient facts we're going to have to exploit later.
03:43
The first of which is that the entropy of x is always less than or equal to the log of the sides of the domain, log n. And this inequality is equality if and only if that distribution is uniform. In some sense, the amount of randomness is maximized when we're presented with a uniform
04:03
distribution. So, going back to our query, for every single attribute that we see in the query, we're never actually going to be using these variables, we're never going to be touching them. Instead, they're just vehicles that are going to deliver us to the bounding formulas
04:22
that we desire. So, in that vein, let's let the joint distribution of capital X, Y, Z, and W be uniformly distributed over the true output of our query. Again, this seems backwards, but it allows us to exploit this very convenient entropic characteristic.
04:42
So, specifically, we have the joint entropy is equal to the log of the size of the query, or equivalently, the exponent of the joint entropy is equal to the size of the query. Looking at this equation, it's clear to see that if you were trying to bound the size of the query, it suffices to bound the joint entropy. How do we do that? Well, we can divide up
05:04
into smaller pieces. Here's one example. The joint entropy is less than or equal to plus the entropy of W given Z. Now, these are very special, these little terms, because each of them relates to a statistic that we can gather
05:23
from a base relation. So, for instance, we look at that middle smaller entropic term, entropy of Y given Z is actually less than or equal to the log of the count from cast. Essentially, it's less than or equal to log of the number of rows that we see in the
05:42
random variables correspond to the Y and Z attributes, which were the people and movies attributes, and those were those attributes that appeared in the cast relation. The conditional entropic terms, so in this case, entropy of X and Y, entropy of W given Z, are a little more complex. They are less than
06:03
or equal to the log of the max degree. So, for instance, entropy of X given Y, this relates back to the pseudonym relation. Essentially, what we're saying is that this entropic term is less than or equal to the log of that individual with the most pseudonyms or misspellings. So, for instance, Snoop Dogg could have different pseudonyms such as Snoop,
06:25
Snoop Doggy Dogg, Snoop, or previously Snoop Lion, and those would all be valid. So, we know that's at least four. Similarly, for entropy of W given Z. And we can also refer to them using this more compact notation.
06:41
So, here's the equation that we have so far. The size of the query is equal to the exponent of the joint entropy, which is less than or equal to the exponent of the sum of small entropic terms, each of which individually is less than or equal to a statistic that we can gather from a relation. Great. What's more, we just chose this one highlighted in red
07:04
sum of entropic terms. We could have chosen any of these, each of which leads to a valid cardinality bound. So, in fact, our equation looks like this. The size of the query is less than or equal to the minimum, the best performing of all these different products.
07:20
This is known information. It's prior work. The question is, is it useful? The problem is not yet. These cardinality bounds that we're generating are still too loose. Our main contribution is how can we tighten them? So, let's get that going. Again, let's go back to
07:41
our example query, and let's divide up the query blocks into these different logical partitions in the same way as a hash join. We're subdividing based on the hash value of joint attributes. Now, you can connect up different corresponding pieces from different
08:02
relations, and you can execute the query on this, and it would generate a subset of the full output, just like a hash join. Now, what this also means is that you can take the disjoint union over all these smaller pieces, and you will, in fact, produce the full output.
08:20
This, therefore, means that if we generate a bound on each of these smaller pieces and sum up those smaller bounds, that value that you would then end up with will also be a bound on the full output, and this is the main insight. Because we're using more and more fine-grained input information about the underlying data, we can generate a tighter bound
08:43
by using more and more logical partitioning. So, actually, our equation looks like this. The size of the query is less than or equal to the sum over smaller pieces of the minimum of all of these different bounding formulas with respect to those smaller pieces.
09:03
Some characteristics about this data model. We're going to have one bound sketch per table, and each sketch will encapsulate those statistics that we need, specifically the count and degree statistics. However, because this is similar to a hash join, and just like a hash join, this is only applicable for equijoins. We can't do more
09:22
odd equijoins. So, there are some optimizations we had to do in order to push this to practicality. We're not going to go over all of them now, but you can review all of them in the rest of the paper. I'll just touch on them briefly. The bound formulas are each derived from a specific entropic formula.
09:44
As you add more and more relations, the number of entropic bounding formulas can grow massively. We therefore present an algorithm to actually prune this space to only focus on a practical subset of those bounding formulas. In a similar vein, as you increase the hash size, the actual
10:04
runtime of generating the bounds increases exponentially with that hash size. Moreover, there's also a very weird non-monotonic behavior, where as you increase the hash size, we actually saw at some points that the bound would actually start becoming looser again. This is unexpected byproducts of hash collisions, and therefore
10:26
we introduced a budgeting strategy that would combat both of these simultaneously. Finally, if we actually want to handle complex predicates, we have to introduce a kind of filter propagation method, similar to sideways information passing.
10:44
Finally, we can focus on the evaluation. In order to evaluate, we need to find a workload that was complex, had real-world data, which means we would have skew, we'd have correlation, we'd have complex filter predicates. We therefore also focus on the joiner benchmark,
11:01
as it's fashionable now. Here we see a histogram of the relative error for the bounds and also the estimates produced by a default Postgres instance. Notice that the relative error, the x-axis, is actually exponential, or I should say logarithmic.
11:24
So again, we see this classic underestimation that a lot of optimizers show. If we reflect across the relative error equals one axis, though, and really just want to only compare based on how far off of the true value we are, we see that the bounds we are producing
11:42
are in fact closer to the true values than the default Postgres estimates. It therefore leads us to believe that we could in fact just take those bounds and inject them directly back into the query optimizer and let the optimizer reason on those and see how those plans do.
12:00
And that's exactly what we do. This next graph shows the actual runtimes of individual joiner benchmark queries. This isn't every single one of the joiner benchmark queries. We had to remove some just for visualization purposes. The gray bars each represent the runtime, or I should say average runtime, for joiner benchmark query, and we are sorting on the default Postgres runtime.
12:28
The blue bars represent the runtimes for those plans using bounds. Now you'll see on the left side of the graph that using bounds sometimes leads to slower queries. And this is to be
12:42
expected, because using bounds pushes the optimizer to make more conservative, more safe plans, which will at times lead to longer runtimes. However, if you look at the entire workload as a whole, you'll find that using default Postgres estimates, not using the more
13:00
conservative bounds, will lead to a runtime of around 3200 seconds, whereas using bounds will only will be less than 2000. So overall we have a gain. Now that was only the setting when foreign key indexes were present. Let's assume that the user was not wise and they did not
13:20
populate them ahead of time. In this case, we actually find that Postgres's plans time out very badly. In fact, five, we only show three bars because again we're not showing every query, but five of those queries actually never finish. We have to cut them off after an hour. Even with this, we find that Postgres's aggregate runtime is over 22,000 seconds,
13:41
whereas using our bounds, those plans take just over 2000, an order of magnitude difference. Another important thing to note is that the inclusion or exclusion of foreign key indexes has only increased our runtime by about 20 percent. This leads us to suggest that using bounds in fact produces more robust plans, and that's the main takeaway. We have gains on those
14:04
queries that are very very slow, very disastrous when using naive estimates, and for those queries that are already relatively fast or about on par. Thank you to everyone at UW for the practice talks.