Going Beyond Provenance: Explaining Query Answers with Patternbased Counterbalances
Video in TIB AVPortal:
Going Beyond Provenance: Explaining Query Answers with Patternbased Counterbalances
Formal Metadata
Title 
Going Beyond Provenance: Explaining Query Answers with Patternbased Counterbalances

Title of Series  
Author 

License 
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. 
Identifiers 

Publisher 

Release Date 
2019

Language 
English

Content Metadata
Subject Area  
Abstract 
Provenance and interventionbased techniques have been used to explain surprisingly high or low outcomes of aggregation queries. However, such techniques may miss interesting explanations emerging from data that is not in the provenance. For instance, an unusually low number of publications of a prolific researcher in a certain venue and year can be explained by an increased number of publications in another venue in the same year. We present a novel approach for explaining outliers in aggregation queries through counter balancing. That is, explanations are outliers in the opposite direction of the outlier of interest. Outliers are defined w.r.t. patterns that hold over the data in aggregate. We present efficient methods for mining such aggregate regression pat terns (ARPs), discuss how to use ARPs to generate and rank explanations, and experimentally demonstrate the efficiency and effectiveness of our approach.

00:00
Data management
Collaborationism
Boris (given name)
Magnetooptical drive
Query language
Pattern language
Variance
Elearning
Itek Corporation
00:25
Curve
Group action
Network operating system
Range (statistics)
Database
Data model
Causality
Goodness of fit
Causality
Query language
Query language
System programming
Resultant
Physical system
01:27
Data model
Subset
Causality
Word
Slide rule
Direction (geometry)
System programming
Elearning
Resultant
Subset
01:58
Plane (geometry)
Word
Slide rule
Boris (given name)
Forest
Universal product code
Video game
Right angle
02:57
Sign (mathematics)
Table (information)
Logarithm
Query language
Authorization
Quicksort
Mereology
Resultant
Form (programming)
Local Group
03:38
Slide rule
Direction (geometry)
Multiplication sign
Counting
Insertion loss
Mereology
Number
Subset
Data mining
Personal digital assistant
Query language
Authorization
Resultant
04:23
Regression analysis
Functional (mathematics)
Group action
Linear regression
Tape drive
Set (mathematics)
Mereology
Total S.A.
Number
Attribute grammar
Data model
Goodness of fit
Analogy
Authorization
Pattern language
Software framework
Endliche Modelltheorie
Partition (number theory)
Linear map
Form (programming)
Electric generator
Theory of relativity
Slide rule
Regression analysis
Linear regression
Repetition
Prisoner's dilemma
Attribute grammar
Database
Measurement
Partition (number theory)
Data mining
Personal digital assistant
Function (mathematics)
Pattern language
Resultant
06:22
Computer virus
Logical constant
Tuple
Standard deviation
Group action
Presentation of a group
Multiplication sign
Plotter
Execution unit
Set (mathematics)
Mereology
Total S.A.
Area
Subset
Cube
Atomic number
Oval
Pattern language
Endliche Modelltheorie
Office suite
Partition (number theory)
Flux
Area
Polynomial
Logical constant
Outlier
Repetition
Electronic mailing list
Attribute grammar
Maxima and minima
Partition (number theory)
Data mining
Type theory
Phase transition
Linearization
output
Website
Pattern language
Quicksort
Functional (mathematics)
Resultant
Point (geometry)
Ocean current
Polynomial
Functional (mathematics)
Patch (Unix)
Maxima and minima
Distance
Event horizon
Number
Attribute grammar
Performance appraisal
Data mining
Reduction of order
Authorization
Mathematical optimization
Linear map
Condition number
Standard deviation
Matching (graph theory)
Slide rule
Information
Total S.A.
Basis <Mathematik>
Line (geometry)
Cartesian coordinate system
System call
Similarity (geometry)
Event horizon
Software
Personal digital assistant
Cube
Video game
Mathematical optimization
Force
Gradient descent
14:55
Slide rule
Computer file
Weight
View (database)
Similarity (geometry)
Solid geometry
Counting
Area
Power (physics)
Number
Revision control
Performance appraisal
Different (Kate Ryan album)
Ranking
Software testing
Physical system
Area
Electric generator
Slide rule
Information
Neighbourhood (graph theory)
Total S.A.
Symbol table
10 (number)
Type theory
Ranking
Resultant
17:47
Slide rule
Demo (music)
Weight
Query language
Ranking
Social class
00:00
so for those of you who were here expecting to hear provenance research sorry for this 1st
00:06
started talking about what it will provenance cannot deny that yeah so here we see the Laura plateaus going beyond
00:17
explaining variances with patternbased conveyances and I'm cheating and this is a collaboration between an Itek and to converse t so a lot of data are being
00:28
collected today people crazy and analyze this data and try to understand them to help from their actions but problemas understanding data's hot when you expect to find something there's always gonna be surprising results year and so there's a need in database which is to to develop tools that help the users understand those surprising results in the preferrably it should be done in a nontechnical way so that all this white range of user can all benefit from and
01:06
this and explaining curves on has been dying of has based on problem and so a lot of things that a lot of things there's a simmering switches good generalization and causality based explanations and also the problem in systems that like and compute the province by reading that queries and also the also also
01:28
some researchers a still was this explaining it explaining this surprising results by
01:35
intervention which is if you remove a subset of the provenance the new Carrizo goes to the opposite direction and then that subsets is a good explanation for that X 1 the surprising results so to summarize all previous words are they some problems so this fact inspired inspire our
01:59
question like do you good explanations only come from Providence and of course our and there's no known problems can be useful and for those of you who are the teams are sorry that I'm going to tell the same joke again so on the left
02:16
is my Advisor forest and the rights is myself and Boras if Boras monitoring my words and he found that suddenly I works only 2 hours yesterday and the answer life and if I answered him only based on provenance my answer can only go as far as yeah I work from light in 1911 and is obviously not in the happy about that of his agonists expect my striving to be couple however if I answer goes young provenance and told him that a travel 48 hours to come to Sidmouth in this sincere will be satisfactory to him and my mind will be safe so yeah let's look at the real
02:59
dataset example so this is a simplified publication datasets so the the sky miles author ID here and venue so user discouraged here to compute the shortest publication prevent every year and part of the result is shown here and now the user we see in the middle of this author's KDD fabricating 2007 is just 1 compared to the Jason years therefore so there you we ask why and this forms the question that we dealing with
03:32
says why higher while question and it's sort of an aggregate so the provenance based
03:40
approach as a mention of the foreign to answer such questions by intervention if you move for a part of the the part of the province you can make the result go to the opposite direction which means maybe this authors sit KDD 2007 paper go up then that would be a good explanation but in this case it's impossible because removing couples can never make count Skoloff this lonely monotone
04:07
and our approaches to counterbalance in just like my low working time can be explained by my high traveling fine the losses low publication number in 1 venue and you can be explained by its high publication number in another venue or another year so as explained Eitel
04:26
so when a user asks that question why is this author publication world is making 2 assumptions 1 is there is a pattern exists in the data that describes the data and we call that that pattern advocacy regression pattern for ERP for short then an assumption is the discussion topic is a lot like so to answer is contained in this question we must 1st my all the patterns in the database and we look for the higher layers that counterbalance the lighter in the related patents and of course there were there might be a lot of those counterbalances not all of them make sense we need to rank all the results in prison talking to the user the the 1st part is done offline so that's the explanation generation can be done online interactive with the user question ends in this whole thing forms our framework tape which stands for counterbalancing with aggregate patter for explanation now I'm going to explain all those 3 parts 1 by 1 Frisbie Floyd to be
05:32
volatile mining we must talk about that interest here an example pattern looks like this for each author and the total publication is linear over the years so here we have 7 Rajavi's recall partition attributes a set of actions we call predicted attributes there's analogous function and regression model solar pattern indicates that if we fix the value of partition the predictor adjectives can predicts the aggregate results over this model so a patent can hold locally on a fixed value of tradition and adjectives for example in this case holes on other X or it can hold the globally if a host for sufficiently many values of partition measurements in this case a good number of authors most many authors then the after you know the
06:25
pattern can start talking about many in so even though we do this mining offline is still not acceptable because a pattern basically divides the atoms into 3 parts partition attributes but at the edges and the edges that are not in a patent so because as the threedivision grows exponentially to the number of attributes explodes very fast we must apply some optimizations to and the first one is restricting size we only allow all maximum for attributes in the pattern and the this alone would do with a reduced number of candidate canister polynomial and the reason we do this is because patterns tend to appear generalize so when there are too many attributes it becomes too finegrained and finegrain natural finegrained patterns tend to be but not understandable and sometimes not even real patterns like you might wonder why after a six so many that suddenly it can predict the so who is not understandable that we can use sort or the biggest amine patterns for example to mine patters that's all of the aggregate terrorist was group a B C D but we would need to be of sorts on a petition attributes and here you have if we have a goof ABC and assorted IVC then it's also sort and I also saw I so in this way we can mine all the fattest next this list and there was 1 sort or further we can detect and apply functional dependencies and for example is here if this pattern holds for each author 1 for each the see candidates and this added a functional the linear model and but implies being as a function dependency than the following pattern for each a and B the set of functions being oversee host the and their function dependencies can neither be given as inputs or can detect as mine and
08:33
so here's some just examples are just some examples of our own experiments that they're down to compare although this lake compare the time after we apply those of my we see here the naive 1 It's flows very fast and the blue line is we compare across mining pattern in the data cube and the day that you would forget you what's very well when they're not so many attributes but it starts as that number of Ashmanskas more is starts waste a lot of time on current because we don't need a lot of those finegrained added results and that that plot on the site on the righthand side we compared the people with all it without function dependency we chose a natural all of the Chicago crime data where there's a lot of function dependencies then it's like because a lot of adjectives surge of geographical and we know the earth and the 3 functional dependencies and the results chosen right there is a list of all the tiny descent speed improvement and that's as much technical detail in a problem this presentation biggest the idea which is coming as cooler in this paper 3rd yeah you want to find counterbalance the 1st thing we need is a relevant pattern which forms the basis of our counterbalance the use is it not just all patches are relevant for our question for example in this case if we have a pattern that some some some other often it's not gonna help us answer the question of this you get this author so the 1st condition for a patented the relevance and is to hold on this question which means it is the holder the the partition matches must come out of this user question couple so for the pattern for a shorter and then you the total publications constant over the years is only relevant if a host the eggs and KDD and then Of course the problem of course it doesn't hold in this case but the problem is in this case we FedEx author Erickson KDD the only variable here as demand this the year so means we can only look for O'Connor viruses in other years and you can see here the issues years and nasty you not so much of an outlier solely or other defined counterbalances thing the conferences we need to we need to allow our little happened to be more generalized so that's the 2nd condition as long as generalized the user question so here's an example for each author out that although total publications linear over the years so here we see author and the year by the actress this pattern is a subset of authors year and venue so this where we say the generalized use question and and of course is still needs to hold all of that and if it holds and the pure only author X is fixed and the leaves room for for us to find values in other conference but how do we do that because this pattern doesn't have conference itself by the modifying as a similar conference we need to refine this pattern the bodies us through this refinement bike before the pattern before for author X that'll publications linear over the years we can find it by adding more tradition attributes and also at the same time we love model had to change it doesn't have to but all we need is there to be a is for there to be a model and then and in this case as long as a host for the X and I see the the full of application is constant over the years we can start looking for counterbalancing points on this had and we need to point out that in this example we have it happens that we find back to the same matches says the user question of any 1 year but it doesn't have to be because this is a very simple example there's only industry accidents but if there are more activities like publication type is a research paper Tamil thicker than we add those into and if that had a hose it can provide the user was even more information so yeah as it turned out that this happen and this pattern can holds so we're the only thing we need to look at look for is a high all layer which is this year's this is a a publication 6 and they're at ICT so the counterbalances his little his little publication of 1 a TDD so we see the explanation returned by cage for this question that it contains this office number of publication other venues are the units for example it also contains his ICT fabrication 2006 is you the publication 27 but it doesn't need to have to doesn't have to be the same schema as his interest and so for example this 1 but you probably already sees that not all turned events is a good we need a score them and return top and scoring its 1st phase from a distance between the user can use a question tough on explanation couple here for example for the user question author KDD 2007 1 question 2007 would be better in 2006 and I see the war veterans some conference other areas like sitcom which is network and the other part is it's based on the deviation of explanation couples from its expected value further because we believe that higher deviation means more unusual and more unduly events is are more likely to calls are the 100 events but the example here why we don't just directly find counterbalancing his KDD publication because engages in years don't not so many of the those publications and not so much of our life however in nice EDU unions it we can see in 2007 his publications allow higher than his expected value so this would be a good counterbalance so to
14:58
come convince you that our system actually does generate banquet explanations we provide 1 more example here this they slash Chicago this is a simplified version of the Chicago crime data which led which Elon I take an and a symbol file of course is the crime tied the community area the crime is in and the year than the current and and the user we wonder why is that battery kind 2011 and this community arête 26 low which is only 16 and I'm gonna go through the topics were nations generative us by our system 1 by 1 the first one is so this is a more generalized explanation says the total Publication number in the Knuth Knuth of the crime number in this area in the year after his higher which indicates that maybe 2011 is just the year the criminals are hibernating or maybe it's a year and it's a police in tens of years and the next slide shows this year the battered have climbing the adjacent community is higher than expected which indicates the maybe the criminals are slowly migrating and the next slide shows the very kind in the previous year it as a total in Chicago as high which is probably why did the police departments and put more police powers the next year and the last 1 we see that a similar type of crime assault on that view in this area is high and the no there the difference we can better and salt as Belarus when Samak harm is actually been done by the solids when someone's certain that the so by a lot of batteries changing into assaults them use the police are actually work in a year so you see each of the explaination harmful gives you a little more information and they as a whole give you the whole picture that and the maybe because the previous year the crimes too much the Chicago Police Department decided to put more police power and then on the year the criminals had to migrate to the adjacent neighborhood where maybe police patrolling is not an intensive and maybe the place to harmonize all the money all the criminals get a test guest desperate stages came back the year after
17:18
so here comes our conclusion more and more of a summarization so problems can be insufficient in surprising for results and the reasonable explanations can be given by counterbalance here we what we did with my hat is flying and we looked what kind of evidence and rank of kind of answers online and presented that wants to the users to help them understand and unexpected results in the
17:48
and that in the future we would like to extend the larger class of various for example and to support joints and yeah that's that's
17:57
it for today and here is a link for get help and for a for a good health and also will have a demo for K for the the the this year and you're welcome to come