Going Beyond Provenance: Explaining Query Answers with Pattern-based Counterbalances
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 155 | |
Autor | ||
Lizenz | CC-Namensnennung 3.0 Deutschland: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/42903 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
SIGMOD 201952 / 155
34
37
44
116
120
122
144
148
155
00:00
DatenverwaltungMagnetooptischer SpeicherBORIS <Programm>AbfrageMusterspracheE-LearningNetzwerkbetriebssystemDatenmodellPhysikalischer EffektSystemprogrammierungTeilmengeResultanteGüte der AnpassungAbfrageChi-Quadrat-VerteilungPhysikalisches SystemDatenbankKollaboration <Informatik>Spannweite <Stochastik>GruppenoperationPhysikalischer EffektCheat <Computerspiel>KurvenanpassungItek CorporationVarianzVorlesung/Konferenz
01:29
TeilmengeDatenmodellSystemprogrammierungPhysikalischer EffektRechenschieberUniversal product codeBORIS <Programm>EbeneTabelleGruppenkeimVorzeichen <Mathematik>AbfrageLogarithmusMusterspracheRegressionsanalyseLineare AbbildungTotal <Mathematik>Attributierte GrammatikPartitionsfunktionWiederkehrender ZustandFunktion <Mathematik>Lineare RegressionKraftLokales MinimumPolynomGlobale OptimierungData MiningFunktionalLeistungsbewertungWürfelResultanteAutorisierungNeuroinformatikZahlenbereichMusterspracheSchnittmengeZählenData MiningFramework <Informatik>MereologieForcingTeilmengeTupelBildschirmmaskeGruppenoperationPartitionsfunktionGüte der AnpassungAttributierte GrammatikAbfrageVorhersagbarkeitGlobale OptimierungLineare RegressionAusreißer <Statistik>Quick-SortOrdnung <Mathematik>Coxeter-GruppeFunktionalDatenmissbrauchPlotterFaserbündelWürfelRegressionsanalyseEin-AusgabeMultiplikationsoperatorGeradeRechter WinkelReelle ZahlGenerator <Informatik>DatenbankStrömungsrichtungRelativitätstheorieWeb SiteLokales MinimumGefangenendilemmaMailing-ListePolynomGradientenverfahrenMagnetbandlaufwerkRichtungAnalogieschlussEinflussgrößeOrdnungsreduktionCASE <Informatik>Wort <Informatik>Wald <Graphentheorie>OrdinalzahlComputerspielEndliche ModelltheorieEinfügungsdämpfungVorlesung/KonferenzComputeranimation
09:52
Total <Mathematik>KonstanteRechenschieberLineare AbbildungAttributierte GrammatikMusterspracheAusreißer <Statistik>Fluss <Mathematik>TupelFlächeninhaltÄhnlichkeitsgeometrieStandardabweichungEreignishorizontLeistungsbewertungRankingZählenGewicht <Ausgleichsrechnung>AbfrageMereologieZahlenbereichEinsFlächeninhaltAttributierte GrammatikTupelRankingTotal <Mathematik>Physikalisches SystemService providerOrdnung <Mathematik>AutorisierungTypentheorieResultanteÄhnlichkeitsgeometrieVersionsverwaltungCASE <Informatik>MusterspracheDemo <Programm>PunktInformationBasis <Mathematik>KonditionszahlErwartungswertAbstandLeistung <Physik>StandardabweichungVerschlingungKlasse <Mathematik>TeilmengeEreignishorizontNachbarschaft <Mathematik>Endliche ModelltheoriePartitionsfunktionSoftwareDifferenteData MiningGenerator <Informatik>ComputervirusMatchingKonstanteStereometrieLinearisierungSymboltabelleOffice-PaketElektronische PublikationRechenwerkSoftwaretestMAPPhasenumwandlungPatch <Software>ZehnRechenschieberMultiplikationsoperatorOvalSystemaufrufSichtenkonzeptComputerspielKartesische KoordinatenComputeranimation
18:10
Computeranimation
Transkript: Englisch(automatisch erzeugt)
00:00
And so for those of you who are here expecting to hear Providence research, I'm sorry, but this first talk is talking about what Providence cannot do, but yeah. So here we, so our title is Going Beyond Providence, Explaining Query Answers with Pattern-Based Counterbalances. And I'm Chi Tien, this is a collaboration
00:22
between Illinois Tech and Duke University. So a lot of data are being collected today. People query and analyze those data and try to understand them to help their actions. But the problem is understanding data is hard. When you expect to find something,
00:41
there's always gonna be surprising results in here. And so there's a need in database which is to develop tools that help the users understand those surprising query results. And preferably, it should be done in a non-technical way
01:00
so that all this wide range of user can all benefit from it. And this explaining query result has been done in the past based on the provenance. A lot of things, there are a lot of things there. There's semi-rings, which is good generalization, and causality-based explanations. And also there's provenance systems
01:20
that compute the provenance by re-enact queries. And also, there are also some researchers still with this explaining these surprising results by intervention, which is if you remove a subset of the provenance,
01:41
your query result goes to the opposite direction, then that subset is a good explanation for that surprising result. So to summarize, all previous works are based on provenance. So this fact inspired our question.
02:00
Like, do good explanations only come from provenance? And of course, our answer is no. Non-provenance can be useful. And for those of you who are the teaser, I'm sorry, but I'm going to tell the same joke again. So on the left is my advisor, Boris. On the right is myself.
02:20
And if Boris is monitoring my work, and he found that suddenly I worked only two hours yesterday, and he asked me why. And if I answer him only based on provenance, my answer can only go as far as, yeah, I worked from 1911. And he's obviously not going to be happy about that. I can just expect my stipend to be cut low.
02:44
However, if my answer goes beyond provenance, I told him that I traveled for eight hours to come to SIGMOD, and this answer will be satisfactory to him, and my money will be safe. So yeah, let's look at the real data set example.
03:01
So this is the simplified publication data set. So the schema is the author ID, year, and venue. So a user may use this query here to compute future author's publication per venue per year, and part of the result is shown here. And now the user may see in the middle,
03:21
this author's KDD publication in 2007 is just one, compared to the adjacent years, there are four. So they may ask why, and this forms the question that we're dealing with. So it's why high or why low question, and it's out of an aggregate query. So the provenance-based approach,
03:41
as I mentioned before, to answer such questions is by intervention. If you remove a part of the provenance, you can make the result go to the opposite direction, which means make this author's KDD 2007 paper go up, then that would be a good explanation. But in this case, it's impossible,
04:01
because removing tuples can never make count go up. Count is only monotone. And our approach is through counterbalance, and just like my low working time can be explained by my high traveling time, an author's low publication number in one venue one year can be explained by his high publication number in another venue or another year.
04:23
So let's explain in detail. So when a user asks that question, why is this author's publication low, he's making two assumptions. One is there's a pattern exists in the data that describes the data, and we call that pattern aggregate regression pattern,
04:41
or ARP for short. And another assumption is the question topple is a low outlier. So to answer this question, we must first mine all the patterns in the database, and we look for the high outliers that counterbalance the low outlier in the related patterns.
05:02
And of course, there might be a lot of those counterbalances, not all of them make sense. We need to rank all the results and present top K to the user. And the first part is done offline, so that the explanation generation can be done online, interactive with the user question. And this whole thing forms our framework CAPE,
05:21
which stands for counterbalancing with aggregate pattern for explanation. Now I'm gonna explain all those three parts one by one. First, before we talk about mining, we must talk about pattern first. Here an example pattern looks like this. For each author, the total publication is linear over the years.
05:42
So here we have a set of attributes we call partition attributes, a set of attributes we call predictor attributes, there's an aggregate function, and a regression model. So a pattern indicates that if we fix the value of partition attributes, then the predictor attributes can predict
06:00
the aggregate results over this model. So a pattern can hold locally on a fixed value of attributes. For example, in this case, it holds on author X. Or it can hold globally if it holds for sufficiently many values of partition attributes. In this case, a good number of authors,
06:20
like those many authors. And after we know the pattern, we can start talking about mining. So even though we do this mining offline, brute force is still not acceptable because a pattern basically divides the attributes into three parts, prediction attributes, predictor attributes,
06:42
and the attributes that are not in a pattern. So because it's the three division, it grows exponentially to the number of attributes, which explodes very fast. We must apply some optimizations to it. The first one is restricting size. We only allow maximum four attributes in a pattern.
07:02
And this alone would reduce the number of candidate patterns to polynomial. And the reason we do this is because patterns tend to appear when we generalize. So when there are too many attributes, it becomes too fine-grained. And fine-grained patterns tend to be not understandable
07:21
and sometimes not even real patterns. You might wonder why, after I fixed so many values, suddenly it can predict B. So it's not understandable. And we can reuse sort order because to mine patterns, for example, to mine patterns that's out of an aggregate query
07:40
with group A, B, C, D, we need to sort on the partition attributes. And here, if we have a group A, B, C, D and it's sorted on A, B, C, then it's also sorted on A, B, also sorted on A. So in this way, we can mine all the patterns like this listed in there with one sort order.
08:02
And further, we can detect and apply functional dependencies. And for example, it's here. If this pattern holds for each author, well, I mean for each A, then C can predict this aggregate function over a linear model. And A implies B as a functional dependency,
08:21
then the following pattern for each A and B, this aggregate function is linear over C, holds trivially. And functional dependencies can either be given as input or we can detect them as we mine it. And so here are some just examples, or just some examples of our experiments that they're down to compare all those,
08:43
like compare all the time after we apply those optimizations. We see here the naive one explodes very fast. And the blue line is we compare across mining pattern and the data cube. And the data cube works very well when there are not so many attributes.
09:01
But as the number of attributes gets more, it starts to waste a lot of time on querying because we don't need a lot of those fine-grained aggregate results. And the plot on the right-hand side, we compared the with or without functional dependency. We chose eight attributes out of the Chicago crime data
09:21
where there's a lot of functional dependencies within it because a lot of attributes are geographical. And we know there are at least three functional dependencies and the results shows on the right there's at least about a 20% speed improvement. And that's as much technical detail
09:40
we're gonna talk about in this presentation because the idea which is coming is cooler in this paper. So yeah, in order to find counterbalance, the first thing we need is a relevant pattern which forms the basis of our counterbalance because not just all patterns are relevant for a question.
10:01
For example, in this case, if we have a pattern that's on some other author, it's not gonna help us answer the question of this author. So the first condition for a pattern to be relevant it needs to hold on this question which means it needs to hold on. The partition attributes must come out
10:22
of this user question tuple. So for the pattern, for each author and venue, the total publication is constant over the years. It's only relevant if it holds on author X and KDD. And of course, it does hold in this case but the problem is in this case
10:42
we fixed author X and KDD. The only variable here is the year. So it means we can only look for counterbalances in other years and you can see in here the adjacent years are not so much of an outlier. So in order to find counterbalances in other conferences,
11:01
we need to allow our relevant pattern to be more generalized. So that's the second condition as long as it's generalized the user question. So here's an example. For each author, the total publication is linear over the years. So here, we see author and year
11:21
are the attributes in this pattern and it's a subset of author, year, and venue. So this way we say it generalizes the user question. And of course, it still needs to hold on author X and if it holds, here only author X is fixed and it leaves room for us to find counterbalance in other conferences.
11:42
But how do we do that? Because this pattern doesn't have conference itself. Like in order to find counterbalancing on a conference, we need to refine this pattern. That leads us to this refinement. Like for the pattern before, for author X, the total publication is linear over the years.
12:01
We can refine it by adding more partition attributes. And also, at the same time, we allow the model type to change. It doesn't have to. But all we need is for there to be a model. And in this case, as long as it holds, for author X and ICD, the total publication
12:20
is constant over the years, we can start looking for counterbalancing points on this pattern. And we need to point out that in this example, it happens that we refine back to the same attributes as the user question, author, venue, and year, but it doesn't have to be. Because this is a very simple example. There's only these three attributes. But if there are more attributes, like publication type,
12:42
either research paper or demo paper, then we can add those things too. And if that pattern holds, it can provide the user with even more information. So yeah, as it turned out, this pattern can hold. So the only thing we need to look for
13:02
is a high outlier, which is this year's, yeah, it's publication six at ICD. So the counterbalances is low publication at KDD. So we see the explanation returned by Cape
13:21
for this question. It contains this author's number of publication other venues or other years. For example, the author contains his ICD publication in 2006, his VLDB publication in 2007. And it doesn't have to be the same schema as the user question. And so, for example, this one.
13:43
But you probably already see that not all counterbalances are good. We need to score them and return top ones. And the scoring is first based on the distance between the user question top one and explanation top one. Here, for example, for the user question,
14:01
author asked KDD 2007 one. To answer this question, 2007 would be better than 2006. And ICD would be better than some conference in other area, like SIGCOMM, which is a network. And the other part is it's based on the deviation of explanation top ones from its expected value.
14:22
Because we believe that higher deviation means more unusual. And more unusual events are more likely to cause other unusual events. Like the example here, why we don't just directly find counterbalancing in his KDD publication. Because in the adjacent years, there are not so many of those publications
14:43
and not so much of a outlier. However, in ICD-E, we can see in 2007, his publication is a lot higher than his expected value. So this would be a good counterbalance. So to convince you that our system
15:01
actually does generate good explanations, we provide one more example here. This is a simplified version of the Chicago crime data, which Illinois Tech is in. And it's simplified, of course, as the crime type, the community area the crime is in, and the year then the crime happened.
15:22
And the user will wonder, why is the battery crime in 2011 at this community area 26 low, which is only 16? And I'm gonna go through the top explanations generated by our system one by one. The first one is a more generalized explanation. It says the total crime number in this area
15:44
in the year after is high, which indicates that maybe 2011 is just the year that the criminals are hibernating, or maybe it's a police-intensive year. And the next one shows this year, the battery type of crime in the adjacent community
16:00
is higher than expected, which indicates that maybe the criminals are slowly migrating. And the next one shows the battery crime in the previous year as a total in Chicago is high, which is probably why the police departments put more police power the next year.
16:21
And the last one we see that the similar type of crime, assault, on that year in this area is high. And we know the difference between battery and assault is battery is when some harm has actually been done, but assault is when someone's threatened that I'm gonna harm you. So by a lot of batteries changing into assault,
16:40
that means the police are actually working that year. So you see, each of the explanation couple gives you a little more information, and they as a whole give you the whole picture that maybe because the previous year the crime is too much, the Chicago Police Department decided to put more police power. And then on the year, the criminals had to migrate
17:03
to the adjacent neighborhood where maybe police patrolling is not that intensive, and maybe the police department is out of money, or the criminals gets too desperate, they just came back worse the year after. So here comes our conclusion, or more of a summarization.
17:22
So providence can be insufficient in explaining surprising career results. And reasonable explanations can be given by counterbalance. Here, what we did, we mined patterns offline, and we looked for counterbalance, and rank all the counterbalances online, and presented the hard ones to the users
17:41
to help them understand the unexpected results. And in the future, we would like to extend to larger class of queries, for example, to support joints. And, yeah, that's it for today. And here's a link for GitHub,
18:01
and also we'll have a demo for CAVE at VLDB this year. You're welcome to come.