Probabilistic Databases with an Infinite Open-World Assumption
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 155 | |
Author | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/42867 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
SIGMOD 201923 / 155
34
37
44
116
120
122
144
148
155
00:00
Customer relationship managementMagneto-optical driveGrohe, MartinDatabaseInfinityOpen setDatabaseInfinityOpen setTable (information)Lecture/Conference
00:21
DatabaseDomain nameType theoryMotion captureData modelModel theorySubsetFinite setInstance (computer science)Universe (mathematics)SpacetimeEvent horizonTupleSingle-precision floating-point formatIndependence (probability theory)Set (mathematics)Closed setOpen setSet (mathematics)Endliche ModelltheorieContext awarenessDatabaseAttribute grammarDivisorString (computer science)Marginal distributionSound effectStatement (computer science)Multiplication signSingle-precision floating-point formatProbability spaceTheorySubsetEvent horizonWell-formed formulaElectronic mailing listPay televisionGenderUniverse (mathematics)Probability distributionInformationSemantics (computer science)Scripting languageCartesian coordinate systemExtension (kinesiology)SpacetimeType theorySigma-algebraObject (grammar)Closed setIndependence (probability theory)Sampling (statistics)Mögliche-Welten-SemantikNumbering schemeWahrscheinlichkeitsmaßTable (information)Row (database)Blackboard systemQuery languageProduct (business)Domain nameRepresentation (politics)CASE <Informatik>Instance (computer science)MeasurementInfinityNatural numberReal numberAlgebraMoment (mathematics)Correspondence (mathematics)Physical systemMessage passingRevision controlOpen setNumberAuthorizationTupleFinite setConstructor (object-oriented programming)Computer animation
08:21
Data modelComplete metric spaceInformationMechatronicsIndependence (probability theory)InfinityExistenceFinite setExtension (kinesiology)TheoremQuery languageApproximationPerformance appraisalSemantics (computer science)Compact spaceP (complexity)Hidden surface determinationComputer fileQuery languageSinc functionTheoremDatabasePoint (geometry)Block (periodic table)Endliche ModelltheorieDirection (geometry)Set (mathematics)Different (Kate Ryan album)Axiom of choiceIndependence (probability theory)Complete metric spaceExtension (kinesiology)Finite setPerformance appraisalInformationProbability spaceApproximationCASE <Informatik>MappingSigma-algebraBoolean algebraDistribution (mathematics)Context awarenessOrder (biology)Sampling (statistics)Condition numberArithmetic progressionSemantics (computer science)Category of beingSeries (mathematics)Product (business)Form (programming)Computer configurationExistenceInfinite setMarginal distributionWahrscheinlichkeitsmaßP (complexity)Programming languageInfinityUniverse (mathematics)Mögliche-Welten-SemantikClosed setSoftware frameworkInstance (computer science)SpacetimeOpen setResultantRight angleArithmetic meanCartesian coordinate systemSound effectPrime idealMeasurementMultiplicationNeuroinformatikLimit of a functionRouting1 (number)Computer animation
16:22
Lecture/Conference
Transcript: English(auto-generated)
00:05
I will talk about probabilistic databases with an infinite open world assumption. And this is joint work with Martin Grohe at the Avitae Ha'ahun University. So kind of the running example for this talk will be this table where we can think of it
00:24
as a list of persons that we have in our database somehow with their name and their gender and their height. And well, in the application I am talking about, we are confronted with various types of uncertainty
00:41
and of course various types of uncertain domains. And as you might see and maybe know, many of these domains that naturally occur are typically infinite. For example, we have three types of domains here. We have the domain of strings
01:00
for the name of the persons, we have some categorical data like the gender and we have the height which might be a positive integer or perhaps more fitting, a real number. So, and there's also another kind of uncertainty here which is there might be tuples existing in the table
01:22
that I have not listed here. So there might be additional information that we are missing. And what I claim is that the custom model of probabilistic databases that has been used in theory so far, the very general one, does not capture these issues very well. So what I will do is introduce a more broad model
01:44
and to do so let me first fix some database schema and some universe to work with and also let me tell you I'm considering set semantics here and a database in the end would just be a collection of facts.
02:00
And I will denote facts by such a blackboard F letter. So this is the set of facts over the schema and this universe and an example of such a fact is the statement that Peter is a male of height 185. And as I said, a database instance is now a finite subset of such sets. So for example, this collection of two facts
02:22
is a database instance. And when it comes to probabilistic databases, one does not consider a single database instance alone but a collection of database instances. So this is this blackboard D here. It's a set of database instances over the set of facts. And this is commonly referred to as a set of possible worlds.
02:42
And so in the general model, in the common model, the set of possible worlds is usually finite. If we move now to an infinite context where the attributes might not only be from over countable domains but also of uncountable domains,
03:01
we additionally need a suitable sigma algebra on this set of database instances. But I will not go into the details that emerge from the problems with measurability of events in sigma algebras here. And so for the moment, we can stick to thinking of possible worlds
03:22
and construct probabilistic databases as probability distributions over such sets of possible worlds. So a probabilistic database in this sense is simply a probability space whose sample space, the blackboard D, is a set of possible worlds
03:40
and we have a probability measure on that. Now, we will distinguish such probabilistic databases being finite or countable or uncountable if the corresponding sample space is. So in particular, when I say an infinite probabilistic database, I still mean that every database instance
04:00
within the set of possible worlds is a finite object but there are just infinitely many options. And for the following, there are a few events that are quite important. So the first thing would be for every individual fact F, the marginal event that the fact F is contained
04:21
in an instance that is drawn at random from the probabilistic database. And I will denote this by this calligraphic D with subscript F. And the straightforward generalization to sets, so script D with subscript capital F is the event that at least one fact of F is contained in a randomly drawn database instance.
04:45
Now, when it comes to finite probabilistic databases, a very common model of representing probabilistic databases uses independence assumptions. And in the case of finite probabilistic databases, the finite tuple independence assumptions
05:02
simply asserts that all occurrences of tuples in the probabilistic database are stochastically independent. So this would mean that the probability of any intersection of such marginal events can be calculated as a product of the individual marginal probabilities.
05:23
And for that reason, we can represent finite tuple independent probabilistic databases by such tables where we annotate each row of the table with a probability. And the probability of a single instance is then given as the probability over those tuples being contained in the database instance
05:41
times the counter-probabilities of those tuples that are not contained in the database instance. And well, what I want to do now is to broaden that notion to a more general context. So a probabilistic database, not necessarily a finite one, will be called tuple independent
06:02
if all the collections F of disjoint sets of facts are independent. So as a formula, you can see it here, but colloquially spoken, it means that whenever I take a collection of disjoint sub-databases of my database, then these will be independent.
06:22
And again, I can calculate these probabilities by multiplying probabilities of smaller events. To compare, this was the definition for finite tuple independence, which basically asserts that all rows of the table are independent, and the message is
06:40
that these two definitions actually coincide whenever we have countable probabilistic databases. So now let me tell you why I'm interested in these kind of probabilistic databases, and the reason for this is what is known as the closed-world assumption.
07:00
So we have again this tuple independent probabilistic database, the example from before, and the following has been noted three years in the past. So I quote, the closed-world assumption of probabilistic databases, that facts not in the database have probability zero, clearly conflicts with their everyday use.
07:22
So this means every information that is not present in this representation system is perceived as an impossible event in the probability space. So for example, we have here a probability of zero, that there are two females. We have a probability of zero, that there are more than four people in the database, and so on.
07:43
And of course, this conflicts with our intuition as soon as we add new facts to the database or want to obtain generally meaningful pre-re-answers. And the authors of the paper, I quote, actually introduced a model of open-world
08:01
probabilistic databases, which is called OpenPDBs, but this model is still restricted to a finite context. So in particular, a priori, they have an upper bound on the number of tuples in such a probabilistic database. And well, what we wanted to do
08:22
is to have a model at hand that would allow our probabilistic databases to grow arbitrarily large. And the basic idea behind what we are doing is still the same. So we have our original probabilistic database and want to obtain a completion.
08:40
And by completion, I mean we take the original probabilistic data and add a model for those facts that have not been specified before. And well, we don't want to do this in an arbitrary way. So of course, we want to preserve somehow the original information that was already there in the probabilistic data to begin with.
09:03
And as for the model for unspecified facts, we are talking about big probability spaces here, infinite ones, so we want to use independence assumptions to keep these probability spaces simple. A more formal definition of this concept would be that a completion of a probabilistic database
09:22
is another probabilistic database, delta prime, with the property that the new sample space we take is now the set of all instances over the set of facts. So every possible world that we can build from our schema and our universe will be contained in the new sample space.
09:40
And the new probability measure satisfies this preservation condition that whenever we condition the probability measure on the set of old instances, we actually have the old probability measure. So this is the idea behind completing probabilistic database
10:02
towards the open world assumption. And of course, also the closed world assumption somehow falls into this framework by just thinking of the completion where every new fact gets the probability zero. But of course, we have a lot of options there. Now, when it comes to these infinite
10:22
kind of independence assumptions and probabilistic databases, it is naturally to ask whether they even exist. And in particular, in the finite case, what is very nice is that any collection of facts and probabilities associated to those facts will span a tuple-independent probabilistic database.
10:43
And it is not clear at all whether the same holds for an infinite setting. So the first thing is what about the set of facts being countably infinite? And here, we obtain this characterization theorem. So whenever the set of facts is countable,
11:02
then the following two conditions are equivalent. Indeed, a tuple-independent probabilistic database is spanned from our fact probabilities if and only if these fact probabilities form a convergent series. So under the hood, the reason behind this is for our products to converge
11:21
and our definition to actually span a probability space. So everything I have told you to this point has also a natural extension to an infinite block-independent disjoint model where we would chop up the data into different blocks and have the independence assumption between the blocks
11:43
but excluding mutually exclusive choices within the blocks. So there's also a natural extension of this model. And even one direction of this theorem can be, it does hold in the other direction
12:00
for the uncountable setting. But of course, in the uncountable setting, we cannot span the database from starting with fact probabilities since in an uncountable probability space, every single marginal probability would be zero. So towards the application of this,
12:21
the previous theorem, this characterization theorem, actually tells us that we are able to specify a model of countable tuple-independent completions of probabilistic databases, which we hope to use to obtain open-world answers to queries.
12:41
And in particular, while we have that model in place, it is not clear whether this actually helps something. So is it at all possible to compute query answers in such a model? And indeed, we can, for every positive epsilon, compute an additive approximation to query answers. So I mean we can approximate the probability
13:03
of Boolean queries, but we can also approximate the marginal probability of answers in the query result. Now, and so far, this is only a very naive way in doing so by using a finite evaluation method,
13:22
and this can be, the precision can be achieved by cutting off the tail of the distribution at the appropriate place. Now, one might wonder why we opted for additive approximations here, and the simple reason is that multiplicative approximations
13:44
cannot be computed in this kind of context. So a countable tuple-independent model is too expressive for allowing multiplicative approximation. We can actually reduce it to the emptiness problem.
14:02
And well, so let me conclude. What I have presented to you is our first step towards a general notion of infinite probabilistic databases. In particular, we wanted to broaden the notions that there are for various independence assumptions, so tuple-independence and the block-independent
14:20
disjoint model, and what we want this to do is to serve as a general proposal for open-world query evaluation of probabilistic databases. So of course, there are a lot of things to be done. So the first thing that we wonder about
14:41
is a sensible semantics for the uncountable case. So in the uncountable setting, a lot of things get nasty very quickly. So for example, we have the sigma algebra that we need to worry about, and it is not even clear how to have a sensible notion of query semantics.
15:01
So what we would want to have is a query to map probabilistic databases to probabilistic databases. And for such mappings to have any meaning, these mappings have to be measurable with respect to the sigma algebras that we use for the instance base. Otherwise, we have no sensible notion of a probability being attached to sets of possible worlds.
15:26
And the other question is of course, how do we compactly represent such kind of probabilistic databases in order to be able to perform tractable query evaluation? And well, the approach we are thinking of here is to take a finite model of probabilistic databases
15:45
as for the original probabilistic database and to carry out the completion in some kind of a probabilistic programming language. And of course, one might wonder whether it is in this context helpful
16:01
to think of uncountable independence assumptions as well. So these points that I just mentioned are what we are working on right now, and we actually solved the problem of the uncountable query semantics, which is currently under review, and well, the other things are work in progress.
16:21
So thank you very much.