On an Information Theoretic Approach to Cardinality Estimation
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 13 | |
Author | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/58130 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | |
Genre |
00:00
Motion captureFundamental theorem of algebraEstimatorBenchmarkSimilarity (geometry)Function (mathematics)Proxy serverRun time (program lifecycle phase)Logical constantIndependence (probability theory)StatisticsConstraint (mathematics)TheoryIdeal (ethics)EstimationVariable (mathematics)Uniform convergenceRandom numberEntropie <Informationstheorie>Distribution (mathematics)Uniformer RaumParameter (computer programming)Degree (graph theory)InformationDatabaseWindowMaxima and minimaFunction (mathematics)ResultantEstimatorPoint (geometry)Condition numberSemiconductor memorySet (mathematics)DatabaseCASE <Informatik>HistogramForm (programming)Independence (probability theory)FrequencySummierbarkeitAdditionConstraint (mathematics)WordNormal (geometry)StatisticsTheory of relativityFunctional (mathematics)LoginAverageConnectivity (graph theory)Profil (magazine)Mathematical optimizationQuery languageImplementationVector spaceNumberDomain nameBlock (periodic table)Helmholtz decompositionProduct (business)InfinityTupleProxy serverMaxima and minimaVariable (mathematics)Run time (program lifecycle phase)Type theoryDifferent (Kate Ryan album)Entropie <Informationstheorie>SoftwareDirection (geometry)Line (geometry)Computer virusLattice (order)Positional notationSimilarity (geometry)NeuroinformatikAxiom of choiceDegree (graph theory)outputSampling (statistics)Characteristic polynomialData storage deviceSheaf (mathematics)Well-formed formulaMoment (mathematics)Predicate (grammar)Cross-correlationPower (physics)PlanningSelectivity (electronic)RandomizationTerm (mathematics)Strategy gameVirtual machineError messageCollaborationismOperator (mathematics)Pairwise comparisonDivisorCommutatorInvariant (mathematics)Physical systemSource codeOrder (biology)Data managementEndliche ModelltheorieNumeral (linguistics)Row (database)Table (information)HypothesisTheoryMereologySupremumAxiomBound stateBest, worst and average casePolynomialGamma functionSubsetConnected spaceSoftware frameworkCategory of beingMarginal distributionEqualiser (mathematics)MathematicsSocial classWorkloadMultiplication signComputer cluster1 (number)CodecSlide rulePotenz <Mathematik>Default (computer science)CountingAttribute grammarCartesian coordinate systemBitSpacetimeDistribution (mathematics)LinearizationAnalogyFraction (mathematics)Covering spaceParameter (computer programming)GleichverteilungOptimization problemLinear programmingPole (complex analysis)Digital divideTriangleInformationstheorieInverse functionLageparameterInformationElectronic data processingRelational databaseOpen setMachine learningOrder of magnitudeSQL Server 7.0Partition (number theory)Correspondence (mathematics)WahrscheinlichkeitsfunktionNonlinear systemContent (media)Source codeXML
Transcript: English(auto-generated)
00:03
Hello. Thanks for your interest in this talk. First of all, I'd like to thank Dan Oteanu and the ICDT 2022 Program Committee for this honor. I am delighted to have the opportunity to share with you a research direction that I have been deeply involved in,
00:23
in the past decade or so, both from the theoretical side and from the implementation side. I lead the design and implementation of the quality optimizer at Relational AI. And thus, naturally, I had to study all major problems in quality optimization pipeline.
00:43
And cardinality estimation is one such problem. Along with colleagues at Relational AI and research collaborators, we studied this problem in depth. And in this talk, I'd like to share with you an approach that we have come up with.
01:01
A variant of these ideas is implemented and working in production at Relational AI. Studying cardinality estimation has been a very rewarding journey. And I hope you will find some of these problems worthwhile to pursue on your own. The content of this talk is based on joint work with many collaborators.
01:27
Makhout Abokhamis at Relational AI. Soomjin Im at UC Merced. Hosan Keshavarz at Relational AI. Phuc Pyeon Kolayres, who needs no introduction in this community.
01:42
Ben Mosley at CMU. Long Nguyen at University of Michigan. Kirk Puus at the University of Pittsburgh. And Dan Suchiu, who also needs no introduction. And Alizera Samarian Zakaria, who is now at Google.
02:03
The talk is going to be divided into four sections. And I will start by briefly motivating the cardinality estimation problem. Then I will explain a simple yet powerful notion of degree constraints.
02:21
Leading to entropy and polymetroids. And the third section covers the so-called frequency moment constraints and histograms. And how we do cardinality estimation with them. Finally, I will conclude with a brief mention of research questions.
02:40
So let's start with the cardinality estimation problem and its challenges. Roughly speaking, cardinality estimation is the following problem. We have a query, a database, which we have gathered some kind of statistical profile of.
03:02
Either online or offline, which is a lot smaller than the database. And before answering the query, we would like to output an estimate of the number of tuples in the output without actually computing the query.
03:21
And computing this estimate, Q-hat, should be really fast and accurate using a small storage. And I'm going to use the notation S of D to denote the statistical profile of the database D. In particular, computing Q-hat should be much faster than computing the query itself.
03:47
At a very high level, there are two types of estimators. The worst case estimator attempts to estimate the worst case output size over all databases D prime.
04:01
That satisfies exactly the same constraints or the same statistical profile as that of D. And in the average case, or in the model-based type of estimators, we would like to assume that the database, the unknown database, is drawn based on some distribution D.
04:24
And we want to estimate the expected number of output tuples over all such databases D prime, drawn from D, subject to or conditioned on exactly the same constraints, the same statistical profile.
04:44
Now, cardinality estimation is a very important problem. It is a key component of the cost estimator for query optimizer or for the domain decomposer in parallel query processing. It can also be used to compute a budget to decide whether or not in-memory data processing is feasible.
05:06
Because if we can guarantee that all of the intermediate results have upper bounded by the memory size, then in-memory data processing is feasible. And from this point of view, the worst case estimator is especially important.
05:25
Now, you don't have to take my word for it. Cardinality estimation is important by just reading documentations of major relational DBMSes out there, such as Microsoft SQL Server.
05:43
You will see that the cardinality estimate is one of the major cost components in a commercial RDBMS. At the same time, this is a very, very challenging problem. After 60 years of building relational database management systems, we still run into many issues with the cardinality estimates.
06:08
And in the words of Guy Lohmann, who was one of the main architects of the DB2 query optimizer, you can see that the estimator can easily introduce errors of many orders of magnitude and that tend to lead to really bad query plans.
06:31
Another source for estimating how bad the cardinality estimators can be is from this famous paper from Victor Lace and others,
06:45
where it was shown that many database systems have estimators which mis-estimate cardinalities by factors of thousands or more. And typically, they are underestimations.
07:02
And in fact, sometimes by even writing the same query, but in a different order, we would end up with estimates that are wildly different from one another. So why is it that this is such a difficult problem?
07:20
Practically, again, by reading technical documentations from promotional database systems, you can find out some of the practical reasons for that. Comparisons operators, the fact that there's no statistics, some assumptions are wrong, built-in functions are hard to deal with, and so on.
07:44
But there are some well-known even theoretical reasons that cardinality estimation is a really, really hard problem. So fundamentally, cardinality estimation is a lossy compression problem. So given that the profile is a lot smaller than the database itself, with similar profiles, there could be very different output cardinalities
08:09
when the underlying data are different. Now sometimes, even the mathematically wrong estimator can actually work pretty well on some benchmark.
08:21
And the reason for that is because we typically use cardinality estimation as a proxy for runtime estimation. And the runtime may be that the wrong estimate actually estimated the runtime correctly on some benchmark. It's also the case that it is well known that estimation errors tend to accumulate exponentially over the query size.
08:46
So this can lead to really bad query plans when the queries are large. And in production systems, we very often see queries with tens, if not close to 100, of predicates that we have to join together.
09:04
Another theoretical problem is cross-joint correlation. This is really difficult to capture for the statistical profiles, which only capture statistics from a single relation at a time.
09:21
So it cannot possibly know that in this query, for instance, that A and B of value 1 and 10 correlated strongly or not. Because the profiles were stored only per input relations or S and T. Some sampling might be able to overcome this problem, but sampling is expensive on its own.
09:48
And then last but not least, user defined functions are typically very challenging to deal with because they tend to be arbitrary. And so the statistical characteristic of these functions are not known in advance.
10:06
Now, there are also some less well known reasons why cardinality estimation is difficult. In particular, if you haven't implemented a cardinality estimator before or an optimizer before,
10:21
you probably would not realize some of these difficulties. So, for example, there are many choices for a statistical profile. You can think about histograms and some product networks or numeric histograms or tries for strings,
10:43
bigrams, trigrams and so on. We can also do sampling with different sampling strategies. We can have statistical models and machine learning models and so on. However, out of these many, many choices, we have to be careful into which choice we can implement and incorporate into an optimizer.
11:07
Because often there's a need to be able to infer the same type of profile through different queries. So, for example, you might have a super query which have many sub queries.
11:20
And we would like to infer the statistical profile of the super query from its sub queries. And this propagates through the entire pipeline. This need is typically not mentioned in papers that people write on cardinality estimation.
11:42
Another probably a little less well known reason for estimation to be difficult is that we really would like cardinality estimation to be order invariant. So, if we were to estimate the output size of Q2 of B and then estimate the output size of Q1 on top of Q2 of B,
12:06
we would like for that estimation to be the same as the case when we apply Q1 first in Q2. If these two queries happen to commute or these two operators.
12:23
So, for example, if both are the Qs or conjunctive queries, they tend to commute. And now, if the estimator is not designed properly, then we may end up with different answers. And this means that the using the cardinality estimator for costing a query plans is going to be misleading due to this dependency.
12:54
Another difficulty is that estimator ideally should incorporate all available information.
13:01
So, in fact, as far as I know, most of the traditional estimators don't make use of functional dependencies at all. Which seems to be a big miss because functional dependencies is such one of the major piece of information that is available to us in the data. There should not be any reason we don't make use of it.
13:27
So, these problems are well known, so I only briefly went through them. And there are some notable references, and you can read these papers along with the references that they mentioned to see the landscape of this problem.
13:46
Now, in terms of existing approaches to cardinality estimation, there are three main approaches. So, one is based on building an offline synopsis like a histogram. The second one is based on sampling.
14:02
So, here we are allowed to pop into the data after we see the query and different sampling strategies. And more recently, machine learning methods started to show up in cardinality estimation.
14:22
Now, in this talk, I'm going to focus only on the first approach where we would like to come up with a cardinality estimate based on offline information such as histogram. So, traditionally, if you open any textbook on cardinality estimation, it is typically based on estimating the selectivity.
14:49
And you will see formulas of this type that involve a couple of pretty strong assumptions, independence and uniformity, along with some default values when the statistics are not available.
15:04
So, these are fairly well known techniques, I'm not going to go through them in any details. Now, the key advantage of the existing approaches are that, well, first of all, they work well enough for many workloads.
15:21
After all, the database industry is about $60 billion a year industry, last I checked in 2020. But at the same time, there are many known shortcomings of existing approaches that are well documented in the papers that I cited just now.
15:43
So, here are some of the major ones. First of all, they are not naturally a multi-way joint estimator. It is based on this selectivity estimation which estimates one joint predicate at a time.
16:00
Some of the assumptions may be too strong, and this leads to the exponential decay or exponential increase in the error. There is typically no statistical guarantee, nothing that we can prove about the quality of the estimate.
16:20
There are magic constants, as you can see, and this means there's a certain lack of robustness because we cannot guarantee anything due to the underestimation and the error accumulation. So, it is difficult to incorporate more knowledge and constraints into the query estimation.
16:48
For example, I mentioned the functional dependency, which is not used in most of the more commonly used cardinality estimator. But most important of all, as a theorist, what bothers me the most is probably the lack of a mathematical framework for cardinality estimation.
17:11
Now, this is not to say that there has been no research theoretically on cardinality estimation. But in many cases, they were meant to address a particular subproblem out of the framework, as we showed here.
17:30
But what I'm referring to is the framework itself, which is not, there's no formal way to prove any kind of guarantee on the selectivity estimation framework.
17:49
So naturally, as a database theorist, the question we should ask ourselves is, how can database theory be more useful, be more helpful in addressing these challenges?
18:02
There are extremely few cardinality estimation papers at POTS and ICDT. Some of the papers that I found were from the 90s, and it would be great if the community takes a deeper look into this really interesting and important problem.
18:23
So, roughly speaking, our contribution in this talk is a new information theoretic approach, which is a mathematical framework for a synopsis-based cardinality estimator. It is a multi-way join estimator.
18:42
It is going to be, for example, order invariant. It solves one of the problems that I mentioned just now. It is model-free, and it's a worst-case estimator. For this reason, it tends to be more robust and less prone to bad query plans when the data change.
19:04
And it exploits in a very principled way all of the commonly available information, including histograms or functional dependencies. And it has interesting connections to information theory, optimization, statistics, and database theory.
19:28
In the second part of the talk, let us dive into a particular type of statistical profiles called degree constraints, and how we are able to use entropies and polynomials to come up with worst-case cardinality estimators for those constraints.
19:50
So, our estimator is going to be a worst-case or model-free estimator. What that means is that it tries to answer the question, what is the maximum output
20:04
size over all databases having the same summary statistics as that of SD that we stored. Mathematically, this is a supremum over the Q of D prime, over all E prime satisfying the same statistical profile as that of D.
20:28
And the first question one has to answer is, what exactly are in the statistical profile SD to begin with? And in this part of the talk, I'm going to describe a particular type of profile called degree constraints,
20:47
which are generalizations of functional dependency constraints and cardinality constraints. And in the next section, I'm going to discuss more general types of constraints and how we deal with them.
21:02
So, this notion of degree constraints was introduced in a couple of papers with Mark Motel-Bokhamis and Dan Tzuchi. Imagine we have a relation over three columns, actor, movie, and role.
21:21
And let's say we have this hypothetical frequency vector, D actor, which could in theory involves billions of entries in it. We are not going to compute this vector. It has a mathematical definition of the vector.
21:41
Now in this vector, we are going to count for every actor, the number of rows in this table that the actor occurs in. So for example, the actor of Alice is one, the actor of Carol is two, because Carol occurs twice in the table, and the actor of Bob is four.
22:06
And the actor of V is zero for every V that's not Alice, Bob, or Carol. So suppose that we have this, this frequency vector, then the degree constraints are meant to capture the infinity norm of such vector.
22:27
So for example, in the above example, the infinity norm of the actor is four. And that is called the degree constraint. Now we don't have to fix the actor column. We can fix, for example, actor and movie together and construct the D actor movie frequency vector.
22:53
Now in that case, the infinity norm is going to count for every fixed pair actor and movie, the number of rows that the actor plays in the movie.
23:07
Now suppose we know that that number is one, then this essentially says that there's a function dependency from actor and movie to row. We can also condition on nothing. So instead of fixing one column like actor or two columns like actor and movie, we can fix nothing.
23:31
Which means the infinity norm, which means this degree vector, which consists of the count of everything. So that means that the infinity norm is the cardinality of the table itself.
23:48
So as you can see from this example alone, degree constraints is a very general notion of constraints. It captures cardinality constraints and function dependencies.
24:01
Now, in general, a degree constraint is a triple x, y, and z, x, y, and n, where x is a subset of attributes, y is a subset of attributes, and n is an integer. And in a relation R, what this constraint says is that if we were to fix x to be a particular tuple x, then the number of different y's is at most n.
24:29
So when we project the selection down to y, then we get n. The count is at most n for every x.
24:42
Now, given such degree constraints, then how do we come up with a worst case cardinality estimator? And the idea is that we are going to estimate this quantity directly, not via estimating selectivity.
25:03
So we're going to solve this optimization problem. You can immediately see the difficulties because there are infinitely many databases which can satisfy these constraints. How do we even go about solving such a problem? So the approach is based on something called the entropy argument, and it goes roughly along this line.
25:30
And this was based on very early on work by Fan Chung and Ron Graham and Shearer in the 80s. The argument goes like this.
25:43
So let's say we fix an arbitrary database D' which is unknown. And we select a random uniform tuple from the output of the query evaluated on this unknown database D'. So here we have a full table.
26:02
And because of uniformity, we know that log of the output size is equal to the entropy over all variables of that distribution. Where h is the entropy function of the joint distribution, which is just the uniform distribution over this hypothetical output.
26:25
Now, so the idea is that instead of trying to maximize Q of D', we are going to try to maximize h of v over new constraints.
26:42
So before we were trying to maximize log of the size of Q on D', we are now going to optimize this over a new set of constraints. That we somehow magically can infer that h has to satisfy given that h was constructed as above.
27:05
And the essence of the idea was that we were able to convert an optimization problem over all databases into an optimization problem over entropic functions satisfying certain constraints.
27:23
Now note that we do not assume uniformity. This is just a counting argument to prove this equality. This is a model-free estimator. We do not assume the data to be uniformly distributed.
27:41
Now, so that's at a very high level, but in order to come up with the estimator itself, we have to be able to compute or solve this optimization problem. Now, unfortunately, the entropic functions are a notoriously rough, difficult class of functions to characterize.
28:04
So the next step is to relax these functions. These are set functions because for every subset of variables, there's a non-negative real numbers, which is the entropy of the marginal on those variables.
28:22
So we are going to relax the requirement that h has to be entropic to the more relaxed conditions that h has to be something called a polymatroid. A polymatroid is a set function, so it's coming from the same space of functions, but it satisfies explicit constraints.
28:49
These are called the Shannon axioms. These are very well-known axioms in information theory that the entropy of the empty set is zero, entropy is a monotonic function, and it satisfies submodularity.
29:06
And it is known that the set of polymatroid functions over a subset of variables is a superset of the set of entropy functions. And what this means is that when we optimize over the entropic function, if we were to
29:26
replace this constraint by the polymatroid functions, then we will get an upper bound on this bound. But on the plus side, the polymatroid constraints are linear constraints, so we can solve the
29:45
optimization problem using linear programming, if the other constraints are also linear, as you will see. So we are going to use gamma n to denote the set of polymatroid functions on n variables.
30:05
So here's one example of all of the Shannon axioms written down in the case of three variables. Let's say our query has three variables, a, b, and c. If we were to write them all down, we would have so many constraints.
30:25
These are the polymatroid constraints or the Shannon axioms constraints. So in the paper with Mahmoud and Dansu-Chiu from POTS 2017, we show the following that we can, given that the statistical profile contains only the degree constraints,
30:50
then we can write down the constraints for the entropic function H as follows.
31:03
These are a set of linear constraints, and they are of the form the conditional entropy of Y given X is less than or equal to log n for every degree constraints X, Y, and n.
31:20
And the conditional entropy is defined to be conditional entropy of A given B is the entropy of A union B minus entropy of B. That's just a shorthand notation for that formula. And so now you can see that these constraints are all linear constraints.
31:43
And so along with gamma n, which is polyhedral, we can now solve this linear program to come up with our cardinality estimator. So next, we are going to consider a few examples. First example is the triangle query. And let's say that we have in the statistical profiles, the sizes of R, S, and T.
32:11
These, as we mentioned earlier, the cardinality constraints are a special case of degree constraints. Now, in this case, suppose we write down all of the constraints from the previous theorem, then our estimator is going to be
32:30
the maximum of H of ABC, subject to H satisfying all of the Shannon axioms and the cardinality constraints that look like this.
32:43
And if we solve this problem, we will end up with the estimator that is n to the 1.5. This is also the Asian bound or the fractional edge cover bound.
33:01
Now, suppose that we have the same query, but we know of one more thing, which is that there's a functional dependency from B to C. So the query looks identical as before, but from the schema information from the database, we know this. Then here in the linear programming problem, we have an extra constraint that says that the conditional entropy of C given B is zero.
33:32
And solving this, we will end up with an estimator that is a lot smaller than the previous estimator. Now it's going to be something along the line of linear in the input size.
33:46
And this is the first example where you can see that functional dependencies comes in the picture in a very natural way in our mathematical framework. We can use the same ideas to bound the output size of queries such as, let's say RS and RA and SB and A plus B is equal to five.
34:09
It is easy to see as a human that the output size can be bounded by the minimum of R and S. Because if you know A, then you know B because of the inverse function.
34:24
If you know B, then you know A because B is equal to five minus A. And it turns out that our estimator following this framework is exactly going to be the same. It's going to estimate the output size to be the minimum of R and S.
34:43
The worst case output size, I meant. There are many examples where the bound that we come up with is pretty non-trivial. And here is one of them. I'm going to skip this example, but I hope you can
35:01
pause the video or look at the slides afterwards to analyze this example in your own time. So one can build a pretty decent query optimizer based on the above constraints and estimate estimator already.
35:20
Believe me, I've done so. However, there is more information in the database that we want to exploit. And we would like the mathematical framework to capture that. However, the major issue we have to worry about is that if we want to keep turning the constraint on d
35:42
prime into the constraint on the entropic function, then we have to be really careful in working out how that might be. And that is going to be the subject of the next part of the presentation. In the next part of the talk, I'm going to describe a new notion of constraints
36:11
which generalize the degree constraints and how we can incorporate those constraints into the histogram framework.
36:23
So again, let's go back to this picture and imagine having this frequency factor on the column actor as before. Now, it seems pretty natural to record, in addition to the infinity norm, other norms of this
36:42
frequency factor, because after all, the frequency factor is the empirical marginal distribution on the actor column. And we would like to be able to record other statistical properties of this marginal.
37:06
And it's very natural to put in here, in addition to the infinity norm, the zero norm, the one norm, the two norms, and so on. And these are called the frequency moment constraints.
37:24
So, for example, the zero norm is just a distinct count. And what it says is that on the column actor, there are three actors in it. The one norm is just the sum of the frequencies, which is exactly the number of rows in a table, and so on.
37:44
More generally, given a relation R and two subsets of columns Y and X, we can define the conditional frequency vector d of Y given X by this quantity. This is a vector indexed by two poles X, and it counts in our number of Y's given that X is fixed to the two poles, the two pole X in R.
38:13
For short, sometimes you write the d of X to be the selection when Y is not the size of the selection, if Y was not given.
38:26
And the frequency moments are summary quantities that are computed from these conditional frequency vectors. The moments are parameterized by an extra parameter P, which is from zero to infinity.
38:45
And it is basically defined to be the P norm of this vector when P is zero, one or infinity. Otherwise, it is the P norm to the power P.
39:01
The frequency moment constraints capture very common statistics that we typically store in a database. Function dependency and cardinality constraints, or cardinality constraints, or degree constraints, they are all special cases of frequency moment constraints.
39:25
The distinct count is frequency moment constraints with zero norm. Sometimes we save the heavy hitters. In that case, this infinity norm counts the maximum number of currencies of a given X in the table, and so on.
39:44
These are very commonly used statistics in the databases. Now, how do we incorporate the frequency moment constraints into histograms?
40:00
Consider the case when we have a table over X, Y, and Z. The way the histograms are constructed typically goes as follows. We fix some column of this table, let's say X. We partition a domain of X into intervals, B1, B2, up to Bk.
40:23
And K in practice is roughly about 200 buckets or so. This is done in MSSQL server. The heavy hitters are singled out to be in their own singleton buckets.
40:40
For the rest of the buckets, we compute them using an equideft histogram computation. So now the bucketization gives rise to per-bucket frequency vectors. These are just the conditional frequency vectors conditioned on the fact that X is in a particular bucket, the X values in a particular bucket.
41:10
Which means that for this vector, anything outside of the bucket is counted as zero. And given the per-bucket frequency vectors, we can now compute the summary statistics of them.
41:29
These are going to be called histogram frequency moment constraints. And they are indexed by the bucketization B, the sets of attributes X and Y.
41:44
And then the frequency moment, the error frequency moment is going to be bounded by C sub B for every bucket B. So basically every bucket has its own upper bound.
42:04
Now, as you can see, the information that we have available is now a lot more intricate than before. So the question is, how do we even turn the statistical profile like this into constraints on the polymerase H?
42:23
And it turns out that in the joint work with Keshavar and Long Nguyen, we were able to prove the analog of the previous result on the Greek constraints. But this time we are able to deal with the histogram frequency moment constraints.
42:47
The constraints are quite a bit more complicated. But the optimization problem is still on the set of entropic functions over a different space. So it is a little bit more than the space of the original number of variables on the query.
43:06
But because this was still over the space of entropic functions, we can still relax the estimator to be over the polymerase H, which means we can compute this quantity by solving an explicit optimization problem.
43:26
So this result is a little difficult to state, especially in the amount of time we have left. I'm going to give you an example, a very, very simple example, which hopefully already conveys the main
43:40
intuition behind how we are able to turn the frequency moment constraints into a constraint on entropy functions. So here is an example. Let's consider the very, very simple query, rxy join with syz.
44:03
And let's just say that we have a bucketization on the variable y on the r side, and we also have the same bucketization on the s side. This is an extremely strong assumption. In practice, because of equideft partitioning, we
44:23
will never have two tables with exactly the same bucketization on the same attribute. However, for simplicity, let's assume that for now. And let's say we have the frequency vectors on the r side, denoted by f sub y, and on the s side denoted by g sub y.
44:48
And let's say that we have the following statistics in the profile. The first thing is that the is the infinity norm per bucket on the r side to be rB for every bucket B.
45:06
And then we also have the zero norm, which counts the number of distinct values, distinct y values in the bucket B. And then we have the number of different z's per y in the same bucket B to be bounded by sB.
45:29
So for every bucket, we have these three quantities now, r, rB, db, and sB. And the question is, how do we use this information to bound the output size r and s, r join with s?
45:47
So the reasoning goes something along this line. So consider the uniform distribution on x, y, z chosen from the output of this join.
46:01
But now we will introduce a new random variable, which is the bucket that this random output tuple belongs. So let J be the categorical random variable, where J is set to be the bucket B if y is in the bucket B.
46:24
And let's say p sub p is the probability that J is equal to B. So then we know the following constraint. We know that for this distribution, which is the uniform distribution chosen on the output of the query,
46:43
if we knew y, then we knew J. Because if we knew y, then we knew which bucket it belongs, which means we know the value of this categorical variable J. And that means that the conditional entropy of J given y is equal to zero.
47:04
Now, secondly, condition on that the bucket that is chosen is B, then we know that the entropy of y is going to be bounded by the number of distinct block of the number of distinct values in B,
47:23
which is just this zero norm that was provided. Similarly, condition on J is equal to B. The entropy of x is going to be bounded by the infinity norm of the frequency vector on the r-side.
47:44
And finally, condition on B, the entropy of z is bounded by log of S sub B. We also know by definition that the entropy of J is the summation of pB log of pB.
48:06
This is my definition of entropy. And of course, because p forms the probability mass function of a distribution, its one norm has to be one and it has to be non-negative. So these are all of the constraints that we know that H has to satisfy in addition to the fact that it has to be an entropy function.
48:32
In fact, we can simplify the constraints a little further by instead of writing constraints condition on particular buckets like this,
48:44
we can write them on aggregate by writing down the conditional entropy of y given J. This is just a weighted average over the conditional entropy of y given J is in B. And from the previous slide, we know that this is basically the dot product of log d and p.
49:08
And when we write all of these down, we now have a set of linear constraints on H. But we also have some nonlinearity in p, which is a new set of variables that we have to deal with in this optimization problem.
49:31
So the problem turns out to be a concave optimization problem, which has the following form. And there are several ways one can solve this problem. For example, we can solve this with mini with alternating maximization,
49:48
fix the b, then we solve a linear program on H and then fix an H. We can solve the concave optimization problem on B and so on. But we are going to leave the question of how to answer or how to compute this cardinality estimate efficiently for a different talk.
50:09
So that gives you a sense of how we were able to derive the cardinality estimation in the case of frequency moment constraints.
50:25
And in order to remove the simplifying assumption, we can do the following. So remember that the simplifying assumption was that the bookization on the R side and on the S side are exactly the same.
50:40
But if they were different, then we will take the final partitioning, which is the intersection from both sides, then we get a new bookization. Of course, under this new bookization, we actually don't know the frequency moment constraints,
51:01
because these are a final partitioning of the domain. But what we can do is we have new variables representing the frequency moments, and then we would just write down the constraints corresponding that these variables have to satisfy
51:21
from the input frequency constraint. And that's how we were able to come up with the new optimization problem. So that's it. That was the end of how we were able to very formally define a new class of worst case model-free cardinality estimators
51:44
based on inputs, which are very general. They are from frequency moment constraints. Now from here, there are many interesting research problems, and I'm not going to have time to go through them.
52:00
If you're interested, you can read the companion article that was submitted along with this talk. But roughly speaking, there are three classes of open research questions. One is on the compatibility of the entropy bound and its relationship to information inequalities. Secondly, the question of how to compute the polynomial bound efficiently
52:30
is a very interesting open question. And thirdly, we can venture into probabilistic guarantees of these estimates,
52:42
where we can take into account more interesting frequency moment norms for different norm values. So in the interest of time, I'm going to skip these few slides. They are described again
53:01
in my companion paper, and I can take any question that you may have. Thanks for listening.