We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers

00:00

Formale Metadaten

Titel
Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers
Serientitel
Anzahl der Teile
155
Autor
Lizenz
CC-Namensnennung 3.0 Deutschland:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
Certain answers are a principled method for coping with uncertainty that arises in many practical data management tasks. Unfortunately, this method is expensive and may ex- clude useful (if uncertain) answers. Thus, users frequently resort to less principled approaches to resolve uncertainty. In this paper, we propose Uncertainty Annotated Databases (UA-DBs), which combine an under- and over-approximation of certain answers to achieve the reliability of certain answers, with the performance of a classical database system. Furthermore, in contrast to prior work on certain answers, UA-DBs achieve a higher utility by including some (explicitly marked) answers that are not certain. UA-DBs are based on incomplete K-relations, which we introduce to generalize the classical set-based notion of incomplete databases and certain answers to a much larger class of data models. Using an implementation of our approach, we demonstrate experimentally that it efficiently produces tight approximations of certain answers that are of high utility
DatenverwaltungBORIS <Programm>DatenhaltungDatenhaltungKollaboration <Informatik>Vorlesung/Konferenz
AggregatzustandKonfiguration <Informatik>AdressraumDifferenteInformation ExtractionLesen <Datenverarbeitung>PlastikkarteComputeranimation
AggregatzustandDatenmodellSigmoide FunktionAdressraumInformationDatenhaltungSampler <Musikinstrument>StrebeGebundener ZustandAbfrageMathematische LogikLokales MinimumSystemprogrammApproximationInstantiierungWhiteboardZahlenbereichDatenhaltungMailing-ListeWort <Informatik>SchnittmengeSoftwaretestApproximationQuick-SortStichprobenumfangMultiplikationsoperatorResultanteTypentheoriePaarvergleichAggregatzustandCASE <Informatik>Konfiguration <Informatik>GruppenoperationTaskSelbstrepräsentationTeilmengeHeuristikAnalysisBildschirmmaskeClientInformationOrdnung <Mathematik>Basis <Mathematik>Prozess <Informatik>Verzweigendes ProgrammEntscheidungstheorieRechter WinkelMaster-GleichungComputersicherheitArithmetischer AusdruckAbstandTermPhysikalisches SystemEndliche ModelltheorieSchaltnetzTupelSystemprogrammAuswahlaxiomDatensatzBereichsschätzungsinc-FunktionMultiplikationWeg <Topologie>AbfrageGraphComputeranimation
DatenhaltungLokales MinimumPhysikalisches SystemDrucksondierungAbfrageFormale SemantikSchnittmengeGreen-FunktionRelation <Informatik>Endliche ModelltheorieApproximationFeasibility-StudieGebundener ZustandImplementierungStandardabweichungRelationale DatenbankKontrast <Statistik>LeistungsbewertungAlgorithmusMaßstabSystemprogrammPrimzahlzwillingeFehlermeldungReelle ZahlMathematische LogikProzess <Informatik>LaufzeitfehlerZufallszahlenAxonometrieZahlenbereichAttributierte GrammatikWarteschlangeDemo <Programm>Einfacher RingInformationBORIS <Programm>RelativitätstheorieOverhead <Kommunikationstechnik>Physikalisches SystemDerivation <Algebra>AggregatzustandProjektive EbeneEinsDatenhaltungGesetz <Physik>Endliche ModelltheorieCASE <Informatik>TermBetafunktionFormale SemantikAnalysisKlasse <Mathematik>Virtuelle MaschineArithmetisches MittelSelbstrepräsentationTabelleBitOrdnung <Mathematik>Translation <Mathematik>Schreiben <Datenverarbeitung>Wort <Informatik>Prozess <Informatik>ResultanteZentrische StreckungInstantiierungApproximationLaufzeitfehlerATMKategorie <Mathematik>ZeitrichtungEin-AusgabeQuick-SortRauschenInformationStichprobenumfangRelationale DatenbankTeilmengeDeskriptive StatistikGraphfärbungE-MailAusnahmebehandlungLoopFunktion <Mathematik>QuellcodeNatürliche ZahlImplementierungSymmetrieAbfrageFehlermeldungTupelStandardabweichungZweiLoginAttributierte GrammatikDemo <Programm>SchnittmengeGebundener ZustandStrömungsrichtungComputeranimation
Transkript: Englisch(automatisch erzeugt)
My name is Su from Illinois Institute of Technology. Our title is uncertainty annotated databases or UADB, and this is a collaboration work with University at Buffalo. Uncertainty is everywhere, in your televisions or different sensors on same target may give you different readings.
Also data cleaning may introduce uncertainty. So the information extractions like searching for a street name may also return multiple options. So for example, if we are unsure about which address is correct, then we can record all possible options.
In this case, we do not know if State Street is in Illinois, Indiana or California. And by the way, this specific representation is called X-tuples. So then, if we have a list of those options, then every row can represent a tuple, and every row with multiple options represents a decision.
Each combination of the decision for each row represents a possible instance, and we call it possible word. For example, choosing like marked options will result the possible word on the right hand side.
So in principle, we can enumerate all possible words. For example, these three possible words and so on, and then we get a naive representation of the incomplete database model. We can also assign probabilities to each possible words, and in which case we call it
probabilistic database. Of course, we want to run queries over it. Naively, we can do this by just running the query over each possible words to get all possible outcomes. For example, if we only have three possible words and we distinct project on state,
then we get these corresponding result for each possible word. And if probabilities are assigned, then we can add probabilities when two or more words produce identical result. However, this is completely non practical, since the number of possible words can be exponential to the number of
decisions we have to make about the data. So people come up with several ways to deal with it more effectively. First is working on the full probabilistic database, which is a very hard question. And there are already extensive works doing a good job on compactly representing probabilistic databases and curating over them
with a relatively better performance. And most of them are either lineage based or sampling based. However, those systems still need to track all or a large set of possible answers, which still have a performance bottleneck.
And then here is a performance comparison between conventional query processing and two types of probabilistic databases we just introduced. We can see that in the best case, the probabilistic query processing is three times slower than the conventional query processing and as uncertainty increases, it can be up to 300 times slower in our test case.
So, probabilistic databases are very expressive by tracking all the possible answers and their probabilities. But doing so is also a lot of work which make it very slow. So then what if we do not have probabilities and since possible answers can be very large, what if we only keep certain answers?
Certain answers means answers in common across all possible words. For example, if we consider ID as one of the attribute, then those two tuples exist in all the possible word and so they are the certain answers.
However, exact certain answers are still expensive to compute because you still have to know all possible outcomes. Thanks to the previous works, we can compute approximated certain answers efficiently. And
also the answer is guaranteed to be a lower bound, which means it's a subset of the actual certain answers. So, we have approximated certain answers, which is efficient, but it does not track all possible answers as well as their probabilities.
And also furthermore, since answers, certain answers only give you a subset of the query results, we want to know how bad is this going to be in practice. So in order to test that, we recall that in reality there is always a ground truth that is fundamentally correct. And of course the ground truth is unknown,
so that's why we model it in a probabilistic database. But let's assume we know it. Then the certain answers are guaranteed to be the subset of the ground truth. But it could be only a very conservative approximation of it. So if we know the ground truth,
we can measure the utility in terms of precision and recall of the answer of the certain answers from the ground truth versus a amount of uncertainty which shows on the graph. And if we use one sentence to describe the graph, then certain answers have high precision and low recall.
So certain answers lacking the utility by dropping all possible tuples. And for now, these are all the principal approaches, and they both give you the answers that you can trust. Unfortunately, nobody uses them in practice.
So in practice, especially in many data cleaning tasks, people just make heuristic choices and totally ignore uncertainty afterwards. This is equivalent to picking one possible word, and we call it the best guess word. The downside of this approach is that we lose all information about uncertainty,
so we have no confidence on any answers we see. So although you cannot trust any answers from the best guess word, you get the efficiency by only working on one word and get the utility by getting closer to the ground truth using heuristics.
There's no way you can get both efficiency and expressiveness in the same time. And since people want efficiency more in the practice, so probabilistic databases are out. And also for the other two approaches, one gives up utility and one gives up trustworthiness.
So for our approach, UADB, we achieve efficiency, trustworthiness, and utility. And for this, we have to give up expressiveness, which means we are not tracking all possible answers. So the obvious solution for that is to combine approximated certain answers
with the best guess. And we notice that approximated certain answers are always a subset of the best guess word. So we can use heuristics to find a best guess word, and we mark whether the tuple is certain or uncertain on the best guess word, where in our case, blue means certain and red means uncertain.
And remember that the approximation of certain answers will be a subset of the actual certain answers, which means every tuple we mark as certain, which is guaranteed to be certain, but the tuples we mark as uncertain may or may not to be uncertain.
But anyway, we are always bounding the actual certain answers between the under-approximation and the best guess. This works because the certain answers must be a subset of every possible word. So, uncertainty comes from different sources, and a lot of existing work comes out ways to modeling uncertainty. And we do not want
reinvent the wheel, so we want to be able to translate existing models of uncertainty into UADB. In our paper, we present translations from three common representations, namely tuple-independent database, ctables, and xtuples. And also if you want to add new models, it's easy.
So after we have a UADB instance, curing over UADB is lightweight and easy to implement. Our system is a query writing-based system, so the written query can run on any conventional database system, like Postgres, Oracle, or etc.
So for using relational database, we need a way to encoding UADB as a relations. We implement the color labeling as the attribute at the end of the table, where false means uncertain and true means certain.
After we have a representation, now we want to know how do we write queries to operate over the representation. And we can look at an example of a query, of a projection over a state. And here is the result on the best guess word. Now the main question is, what is the correct annotation that preserves the bounding?
For the last tuple, Arizona is clear that there's only certainly one Arizona tuple in the input, so the output must contain the Arizona tuple, so we label it as certain. And for the second tuple, Illinois, there are two Illinois tuples in the input that correspond to it,
and where one of them is certain and one of them is uncertain. And since one of the Illinois tuples is certain in the input, so no matter whether other Illinois tuples exist or not, we will always have the tuple in the output, so we also label it as certain. So for the first tuple, New York, both of the two inputs
are uncertain, so the output New York tuple can be missing when both of its uncertain inputs are missing. So in this case, we are label it as uncertain. And for another example of cross-drawings, if we're only showcasing the drawing of these two specific tuples,
we can see that if we're drawing it on a certain tuple with a certain tuple, the output tuple is uncertain since as long as one of the joining tuples is missing, the output tuple will not exist. And now, maybe many audience may feel this example look very familiar,
and it is because we are using K-relations, which means our approach can work not only on certain T-sets, but also on certain bags, on certain provenances, or many other summaries with current semantics adaptable to K-relations. So for summarized technical contributions of our approach,
we generalize the incomplete database and certain answers to K-relations, where each possible word is a K-relation. And also, we can derive UADBs from probabilistic and incomplete database models, and we prove the derivation itself is also an under-approximation.
And also, we prove that the bounds on certain answers are preserved by standard K-relation curing semantics. And also, we have a query-writing-based implementation for bags on a traditional relational database. And for the experiment, we want to know, do UADBs scale well, do UADBs have good utility,
and also how good is our approach approximates the certain answers. For the data set, we use both synthetic and natural and real-world data set, where the synthetic data set is TPC-H with randomly generated errors.
And also, we are comparing against three approaches from three categories, which is approximated certain answers, sampling-based probabilistic databases, and lineage-based probabilistic databases. So we run our approach on the synthetic data set and measuring the runtime in log scale versus the uncertain percentage.
We can see that our approach have roughly 5% overhead comparing with conventional query processing. And also, the low overheads maintain well the amount of uncertainty increase.
And also, the low overhead also scale with the data size. And in terms of our utility, we use a good case. If we use a good case given by a good machine learning model, for example, Spark machine learning, then we can get a precision over 90% and recall over 80%.
And if we do not have information or it's too expensive to get a best guess word, and here is a result of randomly just picking one word from all the possibilities, and we still have a decent performance. And also, this is one out of the 10 examples of the result showing misclassifications of certain answers on real-world data.
And it really depends on the data itself, but in this case, we get a worst case of 0.8% of misclassified tuples. So, by combining the best guess with under-approximation of certain answers,
we developed an uncertainty model that bounds actual certain answers, and it is lightweight and easy to implement. Our future work including extend the model to larger class of queries out of standard K-relation query semantics, and also, we want more precise description on the uncertainty other than just bounds.
And our model also plays a role in the Bezier system, so feel free to try the Bezier demo too. And that's it. Thank you.