Ariadne: Online Provenance for Big Graph Analytics
Formal Metadata
Title 
Ariadne: Online Provenance for Big Graph Analytics

Title of Series  
Author 

License 
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. 
Identifiers 

Publisher 

Release Date 
2019

Language 
English

Content Metadata
Subject Area  
Abstract 
Data provenance is a powerful tool for debugging largescale analytics on batch processing systems. This paper presents Ariadne, a system for capturing and querying provenance from VertexCentric graph processing systems. While the size of provenance from mapreducestyle workflows is often a fraction of the input data size, graph algorithms iterate over the input graph many times, producing provenance much larger than the input graph. And though current provenance tracing procedures support explicit debugging scenarios, like crashculprit determination, developers are increasingly interested in the behavior of analytics when a crash or exception does not occur. To address this challenge, Ariadne offers developers a concise declarative query language to capture and query graph analytics provenance. Exploiting the formal semantics of this datalogbased language, we identify useful query classes that can run while an analytic computes. Experiments with various analytics and realworld datasets show the overhead of online querying is 1.3x over the baseline vs. 8x for the traditional approach. These experiments also illustrate how Ariadne's query language supports execution monitoring and performance optimization for graph analytics.

00:00
Graph (mathematics)
Graph (mathematics)
Magnetooptical drive
System programming
Analytic set
8 (number)
Identity management
00:15
Area
Graph theory
Software
Natural number
Interactive television
Representation (politics)
Computer network
Graph theory
00:34
Scale (map)
Programming paradigm
Graph (mathematics)
State of matter
Graph (mathematics)
Model theory
Software developer
Web page
Analytic set
Parallel port
Parallel computing
Process (computing)
Facebook
Googol
System programming
Computer programming
System programming
Medizinische Informatik
Vertex (graph theory)
Graph theory
01:15
Web page
Graph (mathematics)
Multiplication sign
Online help
Function (mathematics)
Entire function
Scalability
Order of magnitude
Crash (computing)
System programming
Ranking
output
Graph theory
Scale (map)
Graph (mathematics)
Scaling (geometry)
Software developer
Web page
Expression
Mathematical analysis
Analytic set
Motion capture
Line (geometry)
Scalability
Entire function
Graph theory
Process (computing)
System programming
output
Resultant
03:14
Web page
Model theory
Graph (mathematics)
Mereology
Declarative programming
Performance appraisal
Declarative programming
Computer programming
Query language
System programming
Ranking
Vertex (graph theory)
Message passing
Sequence diagram
Form (programming)
Graph (mathematics)
Information
Inheritance (objectoriented programming)
Software developer
Web page
Model theory
Analytic set
Scalability
Formal language
Performance appraisal
Message passing
Personal digital assistant
Query language
Shader <Informatik>
Vertex (graph theory)
Iteration
Ranking
Declarative programming
Sequence diagram
05:15
Expression
Graph (mathematics)
Scaling (geometry)
Multiplication sign
Traverse (surveying)
Number
Formal language
Query language
Performance appraisal
Query language
Medizinische Informatik
Vertex (graph theory)
Process (computing)
Recursion
Message passing
Graph (mathematics)
Inheritance (objectoriented programming)
Analytic set
Instance (computer science)
Evolute
Formal language
Message passing
Query language
Time evolution
System programming
Vertex (graph theory)
output
Recursion
Sequence diagram
Extension (kinesiology)
06:26
Expression
Graph (mathematics)
Graph (mathematics)
Scaling (geometry)
Parallel port
Analytic set
Instance (computer science)
Formal language
Query language
Performance appraisal
Message passing
Performance appraisal
Query language
Query language
System programming
Computer science
System programming
Process (computing)
Eccentricity (mathematics)
Recursion
Extension (kinesiology)
06:55
Graph (mathematics)
Link (knot theory)
Analytic set
Approximation
Approximation
Formal language
Mathematics
Message passing
Personal digital assistant
Query language
Query language
Vertex (graph theory)
Medizinische Informatik
Iteration
Message passing
Sequence diagram
Mathematical optimization
Annihilator (ring theory)
Sequence diagram
07:56
Building
Graph (mathematics)
Robot
Translation (relic)
Mereology
Approximation
Performance appraisal
Query language
System programming
Ranking
Vertex (graph theory)
Eccentricity (mathematics)
Mathematical optimization
Graph (mathematics)
Information
Block (periodic table)
Software developer
Web page
Motion capture
Approximation
Query language
Shader <Informatik>
System programming
Vertex (graph theory)
Mathematical optimization
Declarative programming
Resultant
09:01
Web page
Run time (program lifecycle phase)
Overhead (computing)
Identifiability
Structural load
State of matter
Graph (mathematics)
Multiplication sign
Motion capture
Approximation
Entire function
Order of magnitude
Order (biology)
Performance appraisal
Readonly memory
Semiconductor memory
Computer programming
Query language
System programming
Ranking
Medizinische Informatik
Vertex (graph theory)
output
Mathematical optimization
Eccentricity (mathematics)
Social class
Default (computer science)
Overhead (computing)
Graph (mathematics)
Run time (program lifecycle phase)
Software developer
Web page
Order of magnitude
Approximation
Entire function
Performance appraisal
Graph theory
Query language
Function (mathematics)
Series (mathematics)
System programming
output
Mathematical optimization
Resultant
Inverter (logic gate)
Reduction of order
Asynchronous Transfer Mode
10:28
Quark
Graph (mathematics)
Structural load
Graph (mathematics)
Multiplication sign
Direction (geometry)
Faulttolerant system
Entire function
Entire function
Performance appraisal
Order (biology)
Message passing
Performance appraisal
Query language
Inference
Query language
Vertex (graph theory)
Social class
11:50
Web page
Area
Mobile app
Graph (mathematics)
Table (information)
Inheritance (objectoriented programming)
Graph (mathematics)
Correspondence (mathematics)
Web page
Analytic set
Performance appraisal
Proof theory
Message passing
Performance appraisal
Query language
Query language
Vertex (graph theory)
Ranking
Medizinische Informatik
Table (information)
Message passing
Sequence diagram
13:16
Point (geometry)
Web page
Overhead (computing)
Run time (program lifecycle phase)
Graph (mathematics)
Multiplication sign
Range (statistics)
Mathematical analysis
Analytic set
Mereology
Diameter
Performance appraisal
Degree (graph theory)
Semiconductor memory
Query language
Energy level
Cuboid
Spacetime
Ranking
Mathematical optimization
Overhead (computing)
Graph (mathematics)
Web page
Mathematical analysis
Approximation
Entire function
Graph theory
Performance appraisal
Word
Query language
Reduction of order
Spacetime
14:52
Graph (mathematics)
Run time (program lifecycle phase)
Software developer
Graph (mathematics)
Run time (program lifecycle phase)
Multiplication sign
Software developer
Mathematical analysis
Analytic set
Mathematical analysis
Scalability
Declarative programming
Scalability
Performance appraisal
Performance appraisal
Personal digital assistant
Software framework
Software framework
Declarative programming
00:02
but in about system that adenine but which offers online provenance
00:08
for big graph analytics this is joint work with my advisor selling Dighton can yoke
00:15
so graphs are a natural representation in many areas to denote entities and relationships between them graphs are important we use them in our everyday activities when we connect with friends when we buy online we use them in a
00:29
research tool discovered jack interactions to map the human brain network the graphs that we're interested in this
00:37
work are big the social state has 1 comma decimal 4 billion Veradis's their webscale has 40 would be on the verge this again and the human brain has 100 trillion edges to assist
00:52
developers in a developing big graph analytics we have a big graph processing systems that offer distributed and parallel computation in their majority they follow the verdict centric programming model in which computations happens in parallel on Beate says and the systems are mature they object like by industry and are able to skate trillions of edges now we have
01:17
systems that assist developers and building the graph analytics but there's not too in help developers tuning their analytics so let's see an example we have Peichun that is running on a bird centric system on an input graph that the user provides and we get the output of page on now what developers would like question that we would like to have answered what if they're a malformed inputs in the input graph what if there are missing values or unexpected findings more over they would like to answer questions about whether there about in there elliptic or even more importantly what if the analytic is inefficient and how can we optimize it finally they have questions about the output of the analysis what is there correct values what if the output is of low quality how can we improve the quality or what if there are absent violence the current this is in very time consuming process but we have provenance which can help us answer these questions however
02:25
prominence on big graph analytics introduces some special challenges the first one is scalability the prominence for big graph analytics is orders of magnitude larger than the input graph and the traditional approach of capturing everything and tracing of line is not scalable more over traditional tracing it does not scale because the graphs are interconnected and the result of a tribes trace might be the entire input graph the other challenges we have is on expressiveness traditional tracing is not customizable and we see that the developers of the graph analytics are interested in scenarios that go way beyond dividing so they're interested in tuning the analytics in scenarios that have not involved the crash that's why in this paper we
03:16
present body at me we introduce a former provenance model for big graph analytics we propose a declarative language that allows developers to customize capturing and querying to their needs we offer a standard provenance squaring in the form of layered offline evaluation and we introduce a novel evaluation methodology for online aquarium that does not require capturing any provenance and we show how provenance can be used in novel use cases that go beyond debugging
03:51
so as we said we are gonna be we're interested on big graph analytics that are running on verdict centric systems so that that's see the verdict centric model through the example of staterun so and avert extended model that we have the vertices that compute the verdicts program each vertex computes the same vertex program at Super steps or iterations so we can divide the verdicts program in the basic steps and the 1st step a verdicts receives messages from its neighbors it computes its new page on having received the rank of its neighbors and then sends messages to its neighbors and this happens across iterations so we have superstep I I plus one I passed to the same bird it's program so in such a setting what is the information that we would be we would like to have as part of provenance so I would like to have the information of when a verdicts computed what was the verdicts value and what were the messages that sent and received now in order to represent this provenance the requirement that we have is we want it to be able to want to be able to use it in a distributed and parallel fashion and we wanted to be a to declarative query so that's why we
05:17
introduce the provenance graph we have nodes and the Provenance graph that the presented vertex computations are the super stepped so we have the notes for instance that the note vertex a computing answer burst that I I plus 1 and I passed to we have evolution edges that connect nodes that the presented subsequent computations of the same vertex and we have many such edges that are annotated with the message values and our connecting noted that of presented neighbors in the input graph that communicated between each other and send or receive messages between each other now the size of the provenance graph is the input graph times the number of super steps of the job analytic computed now
06:09
what we would like to query this provenance graph we model problems is a graph the queries that we want to run on it unbounded graph traversal hence our language must you must support recursion moreover since the problems graph is a big graph we need distributed and
06:26
parallel evaluation that's why we're going to use vertex centric Datalog which we introduced in a prior work essentially what verdicts 108 data that does it take stock queries and compiles them into graph analytics that can run on bird eccentric systems and we extended bird extended Datalog with the provenance vocabulary so that it was can be fair to for instance messages sent verdict survives
06:52
the super steps etc. so now that we
06:57
have the the graph and the language it's see an interesting link or we use case that continues provenance for so the idea is approximate optimization and essentially what we want to see is if we
07:09
can suggest for fixedpoint graph analytics if it's possible for them to sacrifice accuracy for speed so essentially iterate were less iterations by introducing a small amount of air in the finer design and how this is achieved vertices essentially message their neighbors only if they experience a lot of faith in their values and if a verdict that receive any message from its neighbors it's not going to compute in a superstep so here we see an example of the death of query that does this approximate optimization I'm not gonna go over it you can see the details in the paper but basically what the query does it identifies Myrtice's vertex computations that are safe or unsafe to be skipped so now I have
07:57
presented the basic building blocks of now let's see how we can put them to use so we have paid trunk again running on the bot eccentric system and we have our developer that provides a query where she specifies declaratively what information she wants to have a part of the province she gives this query to Adi Avenue and Avenue compiles sent into a pick a vertex program that runs alongside a child and results in capturing the provenance graph that in the next step
08:33
and she said the developer can load the provenance graph using only avenue into the verdict centric system an issue the approximate optimization query to analyze this provenance graph again not adding translates as query into a vertex problem evaluates this on the verdicts on the provenance graph and the result we have typically results and now this is all flank wearing we capture 1st and carry offline however as we said earlier on introduces a
09:03
novel evaluation methodology that does not require capturing which we coined online querying so we have pay child and the developer now issues immediately the the approximate optimization query she wants to run she doesn't capture any provenance she gives this query to Adi Abney of me comprises into the buttocks program and at the end of the computations we have both the result of page on end the stick query results so both provenance wearing and Peichun are executed simultaneously now online provenance evaluation actually offers great performance benefits as it reduces the runtime overhead from 3 to 8 times to 1 comma decimal 3 times so let's see
09:55
how we achieve and how party and the state offline Provence offline provenance querying and how it achieves online wearing so the default mode of evaluating invert eccentric systems is to load the entire graph in memory however for provenance graphs so this is not scalable and because the provenance graph orders of magnitude larger than the input graph so the question is can we do better and the answer is yes we identify classes of queries that can be evaluated layer at a time so what we do
10:30
essentially is we take the province drafting and divided into layers and then we have query evaluation assigned bordering on these layers so that all nodes in 1 layer of the problems that evaluate 1st the nodes in the next layer evaluate and so on and this allows us to avoid materializing the entire provenance graph but rather evaluate only on 1 layer at a time now the question is can we infer syntactic i.e. but just by looking at the queries for which query such a layer at a time evaluation is possible and this
11:03
work we identified 2 classes of queries we have the Parker dated queries in which the layers of the provenance Draft are loaded in the opposite direction then and computed so 1st we load the layer layer layer I and evaluate the query on these notes then they send messages to layer I minus 1 and we have the nodes in layer I minus 1 evaluating the query we also have the quark forward were directed queries query grace in which we load the provenance graph layers in the same direction the analytic computed so 1st we have the notes and later I minus 1 evaluating the sending messages then to the nodes and later I and then the notes and later I evaluate now for forward direction
11:52
queries actually we see that there's a correspondence between the apical evaluation that happens on the layer of a provenance graph and to the vertices that compute at that corresponding so as we said in the forward directed queries we have the nodes and later I minus 1 value and sending a message to the notes and later I in pay chance we have vertices B C and b that correspond to the notes in layer II minus 1 computing Peichun and sending their messages you know date which corresponds to the northern layer II and
12:32
this allows us to actually evaluate simultaneously both pickle and analytic now the proofs are in the paper and you can look into how we achieve this and prove that it is correct but basically what area the dust is it depends making evaluation to pay chance so that at every supers that every vertex evaluates simultaneously both the patient and take a more over I have any app pens to the messages that the vertices would normally send with the wrong it upends also picnic tables that are there is either of the query evaluation and the messaging happens as it would normally happen in page now let's move
13:18
on to the experiment so we experimented with 3 other word graphs whose size ranges from 200 million to 1 billion edges we experimented with for graph Analytix and you can find an extensive experimentation sued in the paper at the height of the outcomes of experiments are that Adi Avenue reduces space and time overhead of capturing IT and improve squaring through layer at evaluation offline evaluation and online evaluation an idea and supports the traditional forward and backward in the early nach tracing but also of novel analysis queries such it will as always on monitoring and approximate optimization here I'm going
14:07
to show and the runtime of evaluating the approximate optimization query on page on so we had the blue box that shows Peichun without any provenance we have that pink part which is Peichun running alongside online thickly evaluation in which the overhead is 1 point 3 times we have the green part in which we have offline layer at a time evaluation which introduces an overhead of 4 times and we have offline querying in which we use the straightforward approach of loading the entire provenance graph in memory and we see that the red bar is not able to skate to the level 2 largest datasets so and with this I
14:54
conclude my talk so I presented to you a declarative framework for prominence for big graph analytics we introduce we show the provenance can be used in Nobel use cases that go beyond traditional dividing and I show to you that we offer scalable offline evaluation through layer at a time evaluation an online evaluation with that that most that does not acquire any capturing and that enables developers to use provenance for runtime checks and not only retrospective analysis thank you