Going Beyond Provenance: Explaining Query Answers with Pattern-based Counterbalances

Video in TIB AV-Portal: Going Beyond Provenance: Explaining Query Answers with Pattern-based Counterbalances

Formal Metadata

Going Beyond Provenance: Explaining Query Answers with Pattern-based Counterbalances
Title of Series
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
Provenance and intervention-based techniques have been used to explain surprisingly high or low outcomes of aggregation queries. However, such techniques may miss interesting explanations emerging from data that is not in the provenance. For instance, an unusually low number of publications of a prolific researcher in a certain venue and year can be explained by an increased number of publications in another venue in the same year. We present a novel approach for explaining outliers in aggregation queries through counter- balancing. That is, explanations are outliers in the opposite direction of the outlier of interest. Outliers are defined w.r.t. patterns that hold over the data in aggregate. We present efficient methods for mining such aggregate regression pat- terns (ARPs), discuss how to use ARPs to generate and rank explanations, and experimentally demonstrate the efficiency and effectiveness of our approach.
Data management Collaborationism Boris (given name) Magneto-optical drive Query language Pattern language Variance E-learning Itek Corporation
Curve Group action Network operating system Range (statistics) Database Data model Causality Goodness of fit Causality Query language Query language System programming Resultant Physical system
Data model Subset Causality Word Slide rule Direction (geometry) System programming E-learning Resultant Subset
Plane (geometry) Word Slide rule Boris (given name) Forest Universal product code Video game Right angle
Sign (mathematics) Table (information) Logarithm Query language Authorization Quicksort Mereology Resultant Form (programming) Local Group
Slide rule Direction (geometry) Multiplication sign Counting Insertion loss Mereology Number Subset Data mining Personal digital assistant Query language Authorization Resultant
Regression analysis Functional (mathematics) Group action Linear regression Tape drive Set (mathematics) Mereology Total S.A. Number Attribute grammar Data model Goodness of fit Analogy Authorization Pattern language Software framework Endliche Modelltheorie Partition (number theory) Linear map Form (programming) Electric generator Theory of relativity Slide rule Regression analysis Linear regression Repetition Prisoner's dilemma Attribute grammar Database Measurement Partition (number theory) Data mining Personal digital assistant Function (mathematics) Pattern language Resultant
Computer virus Logical constant Tuple Standard deviation Group action Presentation of a group Multiplication sign Plotter Execution unit Set (mathematics) Mereology Total S.A. Area Subset Cube Atomic number Oval Pattern language Endliche Modelltheorie Office suite Partition (number theory) Flux Area Polynomial Logical constant Outlier Repetition Electronic mailing list Attribute grammar Maxima and minima Partition (number theory) Data mining Type theory Phase transition Linearization output Website Pattern language Quicksort Functional (mathematics) Resultant Point (geometry) Ocean current Polynomial Functional (mathematics) Patch (Unix) Maxima and minima Distance Event horizon Number Attribute grammar Performance appraisal Data mining Reduction of order Authorization Mathematical optimization Linear map Condition number Standard deviation Matching (graph theory) Slide rule Information Total S.A. Basis <Mathematik> Line (geometry) Cartesian coordinate system System call Similarity (geometry) Event horizon Software Personal digital assistant Cube Video game Mathematical optimization Force Gradient descent
Slide rule Computer file Weight View (database) Similarity (geometry) Solid geometry Counting Area Power (physics) Number Revision control Performance appraisal Different (Kate Ryan album) Ranking Software testing Physical system Area Electric generator Slide rule Information Neighbourhood (graph theory) Total S.A. Symbol table 10 (number) Type theory Ranking Resultant
Slide rule Demo (music) Weight Query language Ranking Social class
so for those of you who were here expecting to hear provenance research sorry for this 1st
started talking about what it will provenance cannot deny that yeah so here we see the Laura plateaus going beyond
explaining variances with pattern-based conveyances and I'm cheating and this is a collaboration between an Itek and to converse t so a lot of data are being
collected today people crazy and analyze this data and try to understand them to help from their actions but problemas understanding data's hot when you expect to find something there's always gonna be surprising results year and so there's a need in database which is to to develop tools that help the users understand those surprising results in the preferrably it should be done in a non-technical way so that all this white range of user can all benefit from and
this and explaining curves on has been dying of has based on problem and so a lot of things that a lot of things there's a simmering switches good generalization and causality based explanations and also the problem in systems that like and compute the province by reading that queries and also the also also
some researchers a still was this explaining it explaining this surprising results by
intervention which is if you remove a subset of the provenance the new Carrizo goes to the opposite direction and then that subsets is a good explanation for that X 1 the surprising results so to summarize all previous words are they some problems so this fact inspired inspire our
question like do you good explanations only come from Providence and of course our and there's no known problems can be useful and for those of you who are the teams are sorry that I'm going to tell the same joke again so on the left
is my Advisor forest and the rights is myself and Boras if Boras monitoring my words and he found that suddenly I works only 2 hours yesterday and the answer life and if I answered him only based on provenance my answer can only go as far as yeah I work from light in 1911 and is obviously not in the happy about that of his agonists expect my striving to be couple however if I answer goes young provenance and told him that a travel 48 hours to come to Sidmouth in this sincere will be satisfactory to him and my mind will be safe so yeah let's look at the real
dataset example so this is a simplified publication datasets so the the sky miles author ID here and venue so user discouraged here to compute the shortest publication prevent every year and part of the result is shown here and now the user we see in the middle of this author's KDD fabricating 2007 is just 1 compared to the Jason years therefore so there you we ask why and this forms the question that we dealing with
says why higher while question and it's sort of an aggregate so the provenance based
approach as a mention of the foreign to answer such questions by intervention if you move for a part of the the part of the province you can make the result go to the opposite direction which means maybe this authors sit KDD 2007 paper go up then that would be a good explanation but in this case it's impossible because removing couples can never make count Skoloff this lonely monotone
and our approaches to counterbalance in just like my low working time can be explained by my high traveling fine the losses low publication number in 1 venue and you can be explained by its high publication number in another venue or another year so as explained Eitel
so when a user asks that question why is this author publication world is making 2 assumptions 1 is there is a pattern exists in the data that describes the data and we call that that pattern advocacy regression pattern for ERP for short then an assumption is the discussion topic is a lot like so to answer is contained in this question we must 1st my all the patterns in the database and we look for the higher layers that counterbalance the lighter in the related patents and of course there were there might be a lot of those counterbalances not all of them make sense we need to rank all the results in prison talking to the user the the 1st part is done offline so that's the explanation generation can be done online interactive with the user question ends in this whole thing forms our framework tape which stands for counterbalancing with aggregate patter for explanation now I'm going to explain all those 3 parts 1 by 1 Frisbie Floyd to be
volatile mining we must talk about that interest here an example pattern looks like this for each author and the total publication is linear over the years so here we have 7 Rajavi's recall partition attributes a set of actions we call predicted attributes there's analogous function and regression model solar pattern indicates that if we fix the value of partition the predictor adjectives can predicts the aggregate results over this model so a patent can hold locally on a fixed value of tradition and adjectives for example in this case holes on other X or it can hold the globally if a host for sufficiently many values of partition measurements in this case a good number of authors most many authors then the after you know the
pattern can start talking about many in so even though we do this mining offline is still not acceptable because a pattern basically divides the atoms into 3 parts partition attributes but at the edges and the edges that are not in a patent so because as the three-division grows exponentially to the number of attributes explodes very fast we must apply some optimizations to and the first one is restricting size we only allow all maximum for attributes in the pattern and the this alone would do with a reduced number of candidate canister polynomial and the reason we do this is because patterns tend to appear generalize so when there are too many attributes it becomes too fine-grained and fine-grain natural fine-grained patterns tend to be but not understandable and sometimes not even real patterns like you might wonder why after a six so many that suddenly it can predict the so who is not understandable that we can use sort or the biggest amine patterns for example to mine patters that's all of the aggregate terrorist was group a B C D but we would need to be of sorts on a petition attributes and here you have if we have a goof ABC and assorted IVC then it's also sort and I also saw I so in this way we can mine all the fattest next this list and there was 1 sort or further we can detect and apply functional dependencies and for example is here if this pattern holds for each author 1 for each the see candidates and this added a functional the linear model and but implies being as a function dependency than the following pattern for each a and B the set of functions being oversee host the and their function dependencies can neither be given as inputs or can detect as mine and
so here's some just examples are just some examples of our own experiments that they're down to compare although this lake compare the time after we apply those of my we see here the naive 1 It's flows very fast and the blue line is we compare across mining pattern in the data cube and the day that you would forget you what's very well when they're not so many attributes but it starts as that number of Ashmanskas more is starts waste a lot of time on current because we don't need a lot of those fine-grained added results and that that plot on the site on the right-hand side we compared the people with all it without function dependency we chose a natural all of the Chicago crime data where there's a lot of function dependencies then it's like because a lot of adjectives surge of geographical and we know the earth and the 3 functional dependencies and the results chosen right there is a list of all the tiny descent speed improvement and that's as much technical detail in a problem this presentation biggest the idea which is coming as cooler in this paper 3rd yeah you want to find counterbalance the 1st thing we need is a relevant pattern which forms the basis of our counterbalance the use is it not just all patches are relevant for our question for example in this case if we have a pattern that some some some other often it's not gonna help us answer the question of this you get this author so the 1st condition for a patented the relevance and is to hold on this question which means it is the holder the the partition matches must come out of this user question couple so for the pattern for a shorter and then you the total publications constant over the years is only relevant if a host the eggs and KDD and then Of course the problem of course it doesn't hold in this case but the problem is in this case we FedEx author Erickson KDD the only variable here as demand this the year so means we can only look for O'Connor viruses in other years and you can see here the issues years and nasty you not so much of an outlier solely or other defined counterbalances thing the conferences we need to we need to allow our little happened to be more generalized so that's the 2nd condition as long as generalized the user question so here's an example for each author out that although total publications linear over the years so here we see author and the year by the actress this pattern is a subset of authors year and venue so this where we say the generalized use question and and of course is still needs to hold all of that and if it holds and the pure only author X is fixed and the leaves room for for us to find values in other conference but how do we do that because this pattern doesn't have conference itself by the modifying as a similar conference we need to refine this pattern the bodies us through this refinement bike before the pattern before for author X that'll publications linear over the years we can find it by adding more tradition attributes and also at the same time we love model had to change it doesn't have to but all we need is there to be a is for there to be a model and then and in this case as long as a host for the X and I see the the full of application is constant over the years we can start looking for counterbalancing points on this had and we need to point out that in this example we have it happens that we find back to the same matches says the user question of any 1 year but it doesn't have to be because this is a very simple example there's only industry accidents but if there are more activities like publication type is a research paper Tamil thicker than we add those into and if that had a hose it can provide the user was even more information so yeah as it turned out that this happen and this pattern can holds so we're the only thing we need to look at look for is a high all layer which is this year's this is a a publication 6 and they're at ICT so the counterbalances his little his little publication of 1 a TDD so we see the explanation returned by cage for this question that it contains this office number of publication other venues are the units for example it also contains his ICT fabrication 2006 is you the publication 27 but it doesn't need to have to doesn't have to be the same schema as his interest and so for example this 1 but you probably already sees that not all turned events is a good we need a score them and return top and scoring its 1st phase from a distance between the user can use a question tough on explanation couple here for example for the user question author KDD 2007 1 question 2007 would be better in 2006 and I see the war veterans some conference other areas like sitcom which is network and the other part is it's based on the deviation of explanation couples from its expected value further because we believe that higher deviation means more unusual and more unduly events is are more likely to calls are the 100 events but the example here why we don't just directly find counterbalancing his KDD publication because engages in years don't not so many of the those publications and not so much of our life however in nice EDU unions it we can see in 2007 his publications allow higher than his expected value so this would be a good counterbalance so to
come convince you that our system actually does generate banquet explanations we provide 1 more example here this they slash Chicago this is a simplified version of the Chicago crime data which led which Elon I take an and a symbol file of course is the crime tied the community area the crime is in and the year than the current and and the user we wonder why is that battery kind 2011 and this community arête 26 low which is only 16 and I'm gonna go through the topics were nations generative us by our system 1 by 1 the first one is so this is a more generalized explanation says the total Publication number in the Knuth Knuth of the crime number in this area in the year after his higher which indicates that maybe 2011 is just the year the criminals are hibernating or maybe it's a year and it's a police in tens of years and the next slide shows this year the battered have climbing the adjacent community is higher than expected which indicates the maybe the criminals are slowly migrating and the next slide shows the very kind in the previous year it as a total in Chicago as high which is probably why did the police departments and put more police powers the next year and the last 1 we see that a similar type of crime assault on that view in this area is high and the no there the difference we can better and salt as Belarus when Samak harm is actually been done by the solids when someone's certain that the so by a lot of batteries changing into assaults them use the police are actually work in a year so you see each of the explaination harmful gives you a little more information and they as a whole give you the whole picture that and the maybe because the previous year the crimes too much the Chicago Police Department decided to put more police power and then on the year the criminals had to migrate to the adjacent neighborhood where maybe police patrolling is not an intensive and maybe the place to harmonize all the money all the criminals get a test guest desperate stages came back the year after
so here comes our conclusion more and more of a summarization so problems can be insufficient in surprising for results and the reasonable explanations can be given by counterbalance here we what we did with my hat is flying and we looked what kind of evidence and rank of kind of answers online and presented that wants to the users to help them understand and unexpected results in the
and that in the future we would like to extend the larger class of various for example and to support joints and yeah that's that's
it for today and here is a link for get help and for a for a good health and also will have a demo for K for the the the this year and you're welcome to come