We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Pessimistic Cardinality Estimation

00:00

Formal Metadata

Title
Pessimistic Cardinality Estimation
Title of Series
Number of Parts
155
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
In this work we introduce a novel approach to the problem of cardinality estimation over multijoin queries. Our approach leveraging randomized hashing and data sketching to tighten these bounds beyond the current state of the art. We demonstrate that the bounds can be injected directly into the cost based query optimizer framework enabling it to avoid expensive physical join plans. We outline our base data structures and methodology, and how these bounds may be introduced to the optimizer's parameterized cost function as a new statistic for physical join plan selection. We demonstrate a complex tradeoff space between the tightness of our bounds and the size and complexity of our data structures. This space is not always monotonic as one might expect. In order combat this non-monotonicity, we introduce a partition budgeting scheme that guarantees monotonic behavior. We evaluate ourmethods on GooglePlus community graphs~citegoogleplus, and the Join Order Benchmark (JOB)~citeLeis:2015:GQO:2850583.2850594. In the presence of foreign key indexes, we demonstrate a 1.7times improvement in aggregate (time summed over all queries in benchmark) physical query plan runtime compared to plans chosen by Postgres using the default cardinality estimation methods. When foreign key indexes are absent, this advantage improves to over 10times.
Magneto-optical driveData managementEstimationBound stateQuery languageProjective planeClassical physicsEstimatorLecture/Conference
Query languageUniformer RaumIndependence (probability theory)EstimatorQuery languageTheory of relativityMultiplicationCASE <Informatik>Product (business)Computer animation
Mathematical optimizationQuery languageUniformer RaumIndependence (probability theory)Bound stateGraph (mathematics)Library catalogEntropie <Informationstheorie>Variable (mathematics)LogarithmUniform convergenceAttribute grammarFunction (mathematics)Exponential functionDegree (graph theory)SoftwareGraph (mathematics)Nichtlineares GleichungssystemRandomizationPositional notationQuery languageInstance (computer science)Coefficient of determinationBound statePattern languageWell-formed formulaAbsolute valueFunction (mathematics)SummierbarkeitRandom variableGleichverteilungTerm (mathematics)Group actionPresentation of a groupRow (database)Finite-state machineVariable (mathematics)Theory of relativityGraph (mathematics)PseudonymizationNumberChainInformationCountingEntropie <Informationstheorie>Casting (performing arts)Student's t-testTable (information)Distribution (mathematics)Key (cryptography)Cross-correlationUniformer RaumCombinational logicMultiplication signQuicksortStatisticsProduct (business)ExponentiationProjective planeMaxima and minimaInequality (mathematics)Insertion lossCondition numberMobile WebWorkloadDivisorReal numberMachine visionForm (programming)WordEuler anglesPlastikkarteSimilarity (geometry)CASE <Informatik>PlanningIndependence (probability theory)SkewnessAttribute grammarDifferent (Kate Ryan album)Complex (psychology)Validity (statistics)Series (mathematics)MultiplicationRight angleNormal (geometry)Run time (program lifecycle phase)Mathematical optimizationCharacteristic polynomialComputer animation
Partition (number theory)Bound stateMathematical optimizationQuery languageHash functionCompilation albumDatabaseStatisticsWell-formed formulaDigital filterPredicate (grammar)Run time (program lifecycle phase)Exponential functionGraph (mathematics)Performance appraisalBenchmarkSkewnessComplex systemTopologyCross-correlationError messageDefault (computer science)Maxima and minimaSubject indexingLinear mapProcess (computing)Scale (map)Limit (category theory)Exclusive orReal numberMaxima and minimaPlanningHistogramCross-correlationFehlerschrankeAverageDefault (computer science)SkewnessChemical equationDifferent (Kate Ryan album)Formal grammarPerformance appraisalRun time (program lifecycle phase)Form (programming)EstimatorBound statePredicate (grammar)Multiplication signQuery languageHash functionTheory of relativityCorrespondence (mathematics)Characteristic polynomialInstance (computer science)Endliche ModelltheorieNumberWorkloadSpacetime2 (number)BenchmarkEntropie <Informationstheorie>WordGraph (mathematics)Group actionGraph (mathematics)Table (information)CountingStatisticsDegree (graph theory)Well-formed formulaMathematical optimizationEvent horizonFunction (mathematics)SubsetGame theoryPartition (number theory)CASE <Informatik>Similarity (geometry)Block (periodic table)Derivation (linguistics)Grass (card game)Patch (Unix)1 (number)Point (geometry)SummierbarkeitCollisionoutputStrategy gameInformationKey (cryptography)Order of magnitudePropagatorSubject indexingOperator (mathematics)Inclusion mapProcess (computing)Cartesian coordinate systemVisualization (computer graphics)Nichtlineares GleichungssystemComplex (psychology)Electric generatorAttribute grammarData modelAlgorithmMessage passingComputer animation
Transcript: English(auto-generated)
Today we'll be talking about pessimistic query optimization, our project on how to generate tighter upper bounds on intermediate join cardinalities. The problem that we're focusing on today is the classic problem of cardinality estimation in multi-joint queries. In particular, when some of those relations in a multi-joint query have four-in-key, four-in-key joins.
Because in that case, some of the intermediate products might actually be larger than some of the underlying base tables themselves. And this blow-up, it's what really threatens long runtimes. So, in particular, the problems that optimizers still make, or the mistakes that optimizers still make, is that they assume, for instance,
uniformity across attribute value distributions in a table. Real-world data will show us that in fact there is skew. They also assume that there is independence across those distributions in between joining columns. Again, real-world data shows that in fact there is correlation. These strong assumptions about the underlying data lead to underestimation.
And, as you add more and more relations, this underestimation gets more and more severe, which means that we end up with large, or very aggressive query plans that end up being much slower when you look over the entire workload. So, in this project we asked the question, why not use bounds?
Now, using bounds is an idea that has already been had. Our main contribution is actually how to tighten these bounds. Or, more specifically, tighten these bounding formulas. However, the first thing we need to do is to actually understand how to generate these bounding formulas. And that's where we're going to begin. So, here's an example query. It's going to be the running example for this presentation.
Essentially, what we're doing is just we're finding combinations of pseudonyms, or different misspellings of an individual working in the movie industry, and companies with which that individual worked on a certain film. So, specifically, we have pseudonym, which relates those individuals to those misspellings, or pseudonyms.
Cast, which relates individuals to movies that they worked on, could be writers, could be actors, et cetera. Movie companies is similar. It relates those movies to the companies that were involved with that movie. And, finally, company name has the rest of the information about that company, because a company can work on several different movies.
We can view this query as a chain join with four relations. Two of the joins, well, two of the three joins, are four-in-key, four-in-key, and one is four-in-key, key, as marked by the colored edges. For the remainder of the presentation, we're going to refer to this query using the Datalog notation that we see in the upper right.
Now, the first step to actually generating join cardinality upper bounding formulas is to review entropy. So, as a review, entropy is a mathematical formulation that describes the amount of randomness in a random variable. So, for instance, this entropy, the first formula is a very exact normal definition of entropy.
You can make a similar definition for joint distributions across multiple joint distributions of multiple random variables. You can also make a definition over conditional random variables. What this means is essentially we have entropy of x given y. What we're saying
is what is the amount of randomness that x portrays if you hold y constant. If x and y were correlated, then the entropy of x given y should actually be less than the entropy of just x on its own. Now, if we're assuming that x is a discrete random variable over some finite space, there are some very convenient facts we're going to have to exploit later.
The first of which is that the entropy of x is always less than or equal to the log of the sides of the domain, log n. And this inequality is equality if and only if that distribution is uniform. In some sense, the amount of randomness is maximized when we're presented with a uniform
distribution. So, going back to our query, for every single attribute that we see in the query, we're never actually going to be using these variables, we're never going to be touching them. Instead, they're just vehicles that are going to deliver us to the bounding formulas
that we desire. So, in that vein, let's let the joint distribution of capital X, Y, Z, and W be uniformly distributed over the true output of our query. Again, this seems backwards, but it allows us to exploit this very convenient entropic characteristic.
So, specifically, we have the joint entropy is equal to the log of the size of the query, or equivalently, the exponent of the joint entropy is equal to the size of the query. Looking at this equation, it's clear to see that if you were trying to bound the size of the query, it suffices to bound the joint entropy. How do we do that? Well, we can divide up
into smaller pieces. Here's one example. The joint entropy is less than or equal to plus the entropy of W given Z. Now, these are very special, these little terms, because each of them relates to a statistic that we can gather
from a base relation. So, for instance, we look at that middle smaller entropic term, entropy of Y given Z is actually less than or equal to the log of the count from cast. Essentially, it's less than or equal to log of the number of rows that we see in the
random variables correspond to the Y and Z attributes, which were the people and movies attributes, and those were those attributes that appeared in the cast relation. The conditional entropic terms, so in this case, entropy of X and Y, entropy of W given Z, are a little more complex. They are less than
or equal to the log of the max degree. So, for instance, entropy of X given Y, this relates back to the pseudonym relation. Essentially, what we're saying is that this entropic term is less than or equal to the log of that individual with the most pseudonyms or misspellings. So, for instance, Snoop Dogg could have different pseudonyms such as Snoop,
Snoop Doggy Dogg, Snoop, or previously Snoop Lion, and those would all be valid. So, we know that's at least four. Similarly, for entropy of W given Z. And we can also refer to them using this more compact notation.
So, here's the equation that we have so far. The size of the query is equal to the exponent of the joint entropy, which is less than or equal to the exponent of the sum of small entropic terms, each of which individually is less than or equal to a statistic that we can gather from a relation. Great. What's more, we just chose this one highlighted in red
sum of entropic terms. We could have chosen any of these, each of which leads to a valid cardinality bound. So, in fact, our equation looks like this. The size of the query is less than or equal to the minimum, the best performing of all these different products.
This is known information. It's prior work. The question is, is it useful? The problem is not yet. These cardinality bounds that we're generating are still too loose. Our main contribution is how can we tighten them? So, let's get that going. Again, let's go back to
our example query, and let's divide up the query blocks into these different logical partitions in the same way as a hash join. We're subdividing based on the hash value of joint attributes. Now, you can connect up different corresponding pieces from different
relations, and you can execute the query on this, and it would generate a subset of the full output, just like a hash join. Now, what this also means is that you can take the disjoint union over all these smaller pieces, and you will, in fact, produce the full output.
This, therefore, means that if we generate a bound on each of these smaller pieces and sum up those smaller bounds, that value that you would then end up with will also be a bound on the full output, and this is the main insight. Because we're using more and more fine-grained input information about the underlying data, we can generate a tighter bound
by using more and more logical partitioning. So, actually, our equation looks like this. The size of the query is less than or equal to the sum over smaller pieces of the minimum of all of these different bounding formulas with respect to those smaller pieces.
Some characteristics about this data model. We're going to have one bound sketch per table, and each sketch will encapsulate those statistics that we need, specifically the count and degree statistics. However, because this is similar to a hash join, and just like a hash join, this is only applicable for equijoins. We can't do more
odd equijoins. So, there are some optimizations we had to do in order to push this to practicality. We're not going to go over all of them now, but you can review all of them in the rest of the paper. I'll just touch on them briefly. The bound formulas are each derived from a specific entropic formula.
As you add more and more relations, the number of entropic bounding formulas can grow massively. We therefore present an algorithm to actually prune this space to only focus on a practical subset of those bounding formulas. In a similar vein, as you increase the hash size, the actual
runtime of generating the bounds increases exponentially with that hash size. Moreover, there's also a very weird non-monotonic behavior, where as you increase the hash size, we actually saw at some points that the bound would actually start becoming looser again. This is unexpected byproducts of hash collisions, and therefore
we introduced a budgeting strategy that would combat both of these simultaneously. Finally, if we actually want to handle complex predicates, we have to introduce a kind of filter propagation method, similar to sideways information passing.
Finally, we can focus on the evaluation. In order to evaluate, we need to find a workload that was complex, had real-world data, which means we would have skew, we'd have correlation, we'd have complex filter predicates. We therefore also focus on the joiner benchmark,
as it's fashionable now. Here we see a histogram of the relative error for the bounds and also the estimates produced by a default Postgres instance. Notice that the relative error, the x-axis, is actually exponential, or I should say logarithmic.
So again, we see this classic underestimation that a lot of optimizers show. If we reflect across the relative error equals one axis, though, and really just want to only compare based on how far off of the true value we are, we see that the bounds we are producing
are in fact closer to the true values than the default Postgres estimates. It therefore leads us to believe that we could in fact just take those bounds and inject them directly back into the query optimizer and let the optimizer reason on those and see how those plans do.
And that's exactly what we do. This next graph shows the actual runtimes of individual joiner benchmark queries. This isn't every single one of the joiner benchmark queries. We had to remove some just for visualization purposes. The gray bars each represent the runtime, or I should say average runtime, for joiner benchmark query, and we are sorting on the default Postgres runtime.
The blue bars represent the runtimes for those plans using bounds. Now you'll see on the left side of the graph that using bounds sometimes leads to slower queries. And this is to be
expected, because using bounds pushes the optimizer to make more conservative, more safe plans, which will at times lead to longer runtimes. However, if you look at the entire workload as a whole, you'll find that using default Postgres estimates, not using the more
conservative bounds, will lead to a runtime of around 3200 seconds, whereas using bounds will only will be less than 2000. So overall we have a gain. Now that was only the setting when foreign key indexes were present. Let's assume that the user was not wise and they did not
populate them ahead of time. In this case, we actually find that Postgres's plans time out very badly. In fact, five, we only show three bars because again we're not showing every query, but five of those queries actually never finish. We have to cut them off after an hour. Even with this, we find that Postgres's aggregate runtime is over 22,000 seconds,
whereas using our bounds, those plans take just over 2000, an order of magnitude difference. Another important thing to note is that the inclusion or exclusion of foreign key indexes has only increased our runtime by about 20 percent. This leads us to suggest that using bounds in fact produces more robust plans, and that's the main takeaway. We have gains on those
queries that are very very slow, very disastrous when using naive estimates, and for those queries that are already relatively fast or about on par. Thank you to everyone at UW for the practice talks.