Ariadne: Online Provenance for Big Graph Analytics

Video in TIB AV-Portal: Ariadne: Online Provenance for Big Graph Analytics

Formal Metadata

Ariadne: Online Provenance for Big Graph Analytics
Title of Series
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
Data provenance is a powerful tool for debugging large-scale analytics on batch processing systems. This paper presents Ariadne, a system for capturing and querying provenance from Vertex-Centric graph processing systems. While the size of provenance from map-reduce-style workflows is often a fraction of the input data size, graph algorithms iterate over the input graph many times, producing provenance much larger than the input graph. And though current provenance tracing procedures support explicit debugging scenarios, like crash-culprit determination, developers are increasingly interested in the behavior of analytics when a crash or exception does not occur. To address this challenge, Ariadne offers developers a concise declarative query language to capture and query graph analytics provenance. Exploiting the formal semantics of this datalog-based language, we identify useful query classes that can run while an analytic computes. Experiments with various analytics and real-world datasets show the overhead of online querying is 1.3x over the baseline vs. 8x for the traditional approach. These experiments also illustrate how Ariadne's query language supports execution monitoring and performance optimization for graph analytics.
Graph (mathematics) Graph (mathematics) Magneto-optical drive System programming Analytic set 8 (number) Identity management
Area Graph theory Software Natural number Interactive television Representation (politics) Computer network Graph theory
Scale (map) Programming paradigm Graph (mathematics) State of matter Graph (mathematics) Model theory Software developer Web page Analytic set Parallel port Parallel computing Process (computing) Facebook Googol System programming Computer programming System programming Medizinische Informatik Vertex (graph theory) Graph theory
Web page Graph (mathematics) Multiplication sign Online help Function (mathematics) Entire function Scalability Order of magnitude Crash (computing) System programming Ranking output Graph theory Scale (map) Graph (mathematics) Scaling (geometry) Software developer Web page Expression Mathematical analysis Analytic set Motion capture Line (geometry) Scalability Entire function Graph theory Process (computing) System programming output Resultant
Web page Model theory Graph (mathematics) Mereology Declarative programming Performance appraisal Declarative programming Computer programming Query language System programming Ranking Vertex (graph theory) Message passing Sequence diagram Form (programming) Graph (mathematics) Information Inheritance (object-oriented programming) Software developer Web page Model theory Analytic set Scalability Formal language Performance appraisal Message passing Personal digital assistant Query language Shader <Informatik> Vertex (graph theory) Iteration Ranking Declarative programming Sequence diagram
Expression Graph (mathematics) Scaling (geometry) Multiplication sign Traverse (surveying) Number Formal language Query language Performance appraisal Query language Medizinische Informatik Vertex (graph theory) Process (computing) Recursion Message passing Graph (mathematics) Inheritance (object-oriented programming) Analytic set Instance (computer science) Evolute Formal language Message passing Query language Time evolution System programming Vertex (graph theory) output Recursion Sequence diagram Extension (kinesiology)
Expression Graph (mathematics) Graph (mathematics) Scaling (geometry) Parallel port Analytic set Instance (computer science) Formal language Query language Performance appraisal Message passing Performance appraisal Query language Query language System programming Computer science System programming Process (computing) Eccentricity (mathematics) Recursion Extension (kinesiology)
Graph (mathematics) Link (knot theory) Analytic set Approximation Approximation Formal language Mathematics Message passing Personal digital assistant Query language Query language Vertex (graph theory) Medizinische Informatik Iteration Message passing Sequence diagram Mathematical optimization Annihilator (ring theory) Sequence diagram
Building Graph (mathematics) Robot Translation (relic) Mereology Approximation Performance appraisal Query language System programming Ranking Vertex (graph theory) Eccentricity (mathematics) Mathematical optimization Graph (mathematics) Information Block (periodic table) Software developer Web page Motion capture Approximation Query language Shader <Informatik> System programming Vertex (graph theory) Mathematical optimization Declarative programming Resultant
Web page Run time (program lifecycle phase) Overhead (computing) Identifiability Structural load State of matter Graph (mathematics) Multiplication sign Motion capture Approximation Entire function Order of magnitude Order (biology) Performance appraisal Read-only memory Semiconductor memory Computer programming Query language System programming Ranking Medizinische Informatik Vertex (graph theory) output Mathematical optimization Eccentricity (mathematics) Social class Default (computer science) Overhead (computing) Graph (mathematics) Run time (program lifecycle phase) Software developer Web page Order of magnitude Approximation Entire function Performance appraisal Graph theory Query language Function (mathematics) Series (mathematics) System programming output Mathematical optimization Resultant Inverter (logic gate) Reduction of order Asynchronous Transfer Mode
Quark Graph (mathematics) Structural load Graph (mathematics) Multiplication sign Direction (geometry) Fault-tolerant system Entire function Entire function Performance appraisal Order (biology) Message passing Performance appraisal Query language Inference Query language Vertex (graph theory) Social class
Web page Area Mobile app Graph (mathematics) Table (information) Inheritance (object-oriented programming) Graph (mathematics) Correspondence (mathematics) Web page Analytic set Performance appraisal Proof theory Message passing Performance appraisal Query language Query language Vertex (graph theory) Ranking Medizinische Informatik Table (information) Message passing Sequence diagram
Point (geometry) Web page Overhead (computing) Run time (program lifecycle phase) Graph (mathematics) Multiplication sign Range (statistics) Mathematical analysis Analytic set Mereology Diameter Performance appraisal Degree (graph theory) Semiconductor memory Query language Energy level Cuboid Spacetime Ranking Mathematical optimization Overhead (computing) Graph (mathematics) Web page Mathematical analysis Approximation Entire function Graph theory Performance appraisal Word Query language Reduction of order Spacetime
Graph (mathematics) Run time (program lifecycle phase) Software developer Graph (mathematics) Run time (program lifecycle phase) Multiplication sign Software developer Mathematical analysis Analytic set Mathematical analysis Scalability Declarative programming Scalability Performance appraisal Performance appraisal Personal digital assistant Software framework Software framework Declarative programming
but in about system that adenine but which offers online provenance
for big graph analytics this is joint work with my advisor selling Dighton can yoke
so graphs are a natural representation in many areas to denote entities and relationships between them graphs are important we use them in our everyday activities when we connect with friends when we buy online we use them in a
research tool discovered jack interactions to map the human brain network the graphs that we're interested in this
work are big the social state has 1 comma decimal 4 billion Veradis's their web-scale has 40 would be on the verge this again and the human brain has 100 trillion edges to assist
developers in a developing big graph analytics we have a big graph processing systems that offer distributed and parallel computation in their majority they follow the verdict centric programming model in which computations happens in parallel on Beate says and the systems are mature they object like by industry and are able to skate trillions of edges now we have
systems that assist developers and building the graph analytics but there's not too in help developers tuning their analytics so let's see an example we have Pei-chun that is running on a bird centric system on an input graph that the user provides and we get the output of page on now what developers would like question that we would like to have answered what if they're a malformed inputs in the input graph what if there are missing values or unexpected findings more over they would like to answer questions about whether there about in there elliptic or even more importantly what if the analytic is inefficient and how can we optimize it finally they have questions about the output of the analysis what is there correct values what if the output is of low quality how can we improve the quality or what if there are absent violence the current this is in very time consuming process but we have provenance which can help us answer these questions however
prominence on big graph analytics introduces some special challenges the first one is scalability the prominence for big graph analytics is orders of magnitude larger than the input graph and the traditional approach of capturing everything and tracing of line is not scalable more over traditional tracing it does not scale because the graphs are interconnected and the result of a tribes trace might be the entire input graph the other challenges we have is on expressiveness traditional tracing is not customizable and we see that the developers of the graph analytics are interested in scenarios that go way beyond dividing so they're interested in tuning the analytics in scenarios that have not involved the crash that's why in this paper we
present body at me we introduce a former provenance model for big graph analytics we propose a declarative language that allows developers to customize capturing and querying to their needs we offer a standard provenance squaring in the form of layered offline evaluation and we introduce a novel evaluation methodology for online aquarium that does not require capturing any provenance and we show how provenance can be used in novel use cases that go beyond debugging
so as we said we are gonna be we're interested on big graph analytics that are running on verdict centric systems so that that's see the verdict centric model through the example of state-run so and avert extended model that we have the vertices that compute the verdicts program each vertex computes the same vertex program at Super steps or iterations so we can divide the verdicts program in the basic steps and the 1st step a verdicts receives messages from its neighbors it computes its new page on having received the rank of its neighbors and then sends messages to its neighbors and this happens across iterations so we have superstep I I plus one I passed to the same bird it's program so in such a setting what is the information that we would be we would like to have as part of provenance so I would like to have the information of when a verdicts computed what was the verdicts value and what were the messages that sent and received now in order to represent this provenance the requirement that we have is we want it to be able to want to be able to use it in a distributed and parallel fashion and we wanted to be a to declarative query so that's why we
introduce the provenance graph we have nodes and the Provenance graph that the presented vertex computations are the super stepped so we have the notes for instance that the note vertex a computing answer burst that I I plus 1 and I passed to we have evolution edges that connect nodes that the presented subsequent computations of the same vertex and we have many such edges that are annotated with the message values and our connecting noted that of presented neighbors in the input graph that communicated between each other and send or receive messages between each other now the size of the provenance graph is the input graph times the number of super steps of the job analytic computed now
what we would like to query this provenance graph we model problems is a graph the queries that we want to run on it unbounded graph traversal hence our language must you must support recursion moreover since the problems graph is a big graph we need distributed and
parallel evaluation that's why we're going to use vertex centric Datalog which we introduced in a prior work essentially what verdicts 108 data that does it take stock queries and compiles them into graph analytics that can run on bird eccentric systems and we extended bird extended Datalog with the provenance vocabulary so that it was can be fair to for instance messages sent verdict survives
the super steps etc. so now that we
have the the graph and the language it's see an interesting link or we use case that continues provenance for so the idea is approximate optimization and essentially what we want to see is if we
can suggest for fixed-point graph analytics if it's possible for them to sacrifice accuracy for speed so essentially iterate were less iterations by introducing a small amount of air in the finer design and how this is achieved vertices essentially message their neighbors only if they experience a lot of faith in their values and if a verdict that receive any message from its neighbors it's not going to compute in a superstep so here we see an example of the death of query that does this approximate optimization I'm not gonna go over it you can see the details in the paper but basically what the query does it identifies Myrtice's vertex computations that are safe or unsafe to be skipped so now I have
presented the basic building blocks of now let's see how we can put them to use so we have paid trunk again running on the bot eccentric system and we have our developer that provides a query where she specifies declaratively what information she wants to have a part of the province she gives this query to Adi Avenue and Avenue compiles sent into a pick a vertex program that runs alongside a child and results in capturing the provenance graph that in the next step
and she said the developer can load the provenance graph using only avenue into the verdict centric system an issue the approximate optimization query to analyze this provenance graph again not adding translates as query into a vertex problem evaluates this on the verdicts on the provenance graph and the result we have typically results and now this is all flank wearing we capture 1st and carry offline however as we said earlier on introduces a
novel evaluation methodology that does not require capturing which we coined online querying so we have pay child and the developer now issues immediately the the approximate optimization query she wants to run she doesn't capture any provenance she gives this query to Adi Abney of me comprises into the buttocks program and at the end of the computations we have both the result of page on end the stick query results so both provenance wearing and Pei-chun are executed simultaneously now online provenance evaluation actually offers great performance benefits as it reduces the runtime overhead from 3 to 8 times to 1 comma decimal 3 times so let's see
how we achieve and how party and the state offline Provence offline provenance querying and how it achieves online wearing so the default mode of evaluating invert eccentric systems is to load the entire graph in memory however for provenance graphs so this is not scalable and because the provenance graph orders of magnitude larger than the input graph so the question is can we do better and the answer is yes we identify classes of queries that can be evaluated layer at a time so what we do
essentially is we take the province drafting and divided into layers and then we have query evaluation assigned bordering on these layers so that all nodes in 1 layer of the problems that evaluate 1st the nodes in the next layer evaluate and so on and this allows us to avoid materializing the entire provenance graph but rather evaluate only on 1 layer at a time now the question is can we infer syntactic i.e. but just by looking at the queries for which query such a layer at a time evaluation is possible and this
work we identified 2 classes of queries we have the Parker dated queries in which the layers of the provenance Draft are loaded in the opposite direction then and computed so 1st we load the layer layer layer I and evaluate the query on these notes then they send messages to layer I minus 1 and we have the nodes in layer I minus 1 evaluating the query we also have the quark forward were directed queries query grace in which we load the provenance graph layers in the same direction the analytic computed so 1st we have the notes and later I minus 1 evaluating the sending messages then to the nodes and later I and then the notes and later I evaluate now for forward direction
queries actually we see that there's a correspondence between the apical evaluation that happens on the layer of a provenance graph and to the vertices that compute at that corresponding so as we said in the forward directed queries we have the nodes and later I minus 1 value and sending a message to the notes and later I in pay chance we have vertices B C and b that correspond to the notes in layer II minus 1 computing Pei-chun and sending their messages you know date which corresponds to the northern layer II and
this allows us to actually evaluate simultaneously both pickle and analytic now the proofs are in the paper and you can look into how we achieve this and prove that it is correct but basically what area the dust is it depends making evaluation to pay chance so that at every supers that every vertex evaluates simultaneously both the patient and take a more over I have any app pens to the messages that the vertices would normally send with the wrong it upends also picnic tables that are there is either of the query evaluation and the messaging happens as it would normally happen in page now let's move
on to the experiment so we experimented with 3 other word graphs whose size ranges from 200 million to 1 billion edges we experimented with for graph Analytix and you can find an extensive experimentation sued in the paper at the height of the outcomes of experiments are that Adi Avenue reduces space and time overhead of capturing IT and improve squaring through layer at evaluation offline evaluation and online evaluation an idea and supports the traditional forward and backward in the early nach tracing but also of novel analysis queries such it will as always on monitoring and approximate optimization here I'm going
to show and the runtime of evaluating the approximate optimization query on page on so we had the blue box that shows Pei-chun without any provenance we have that pink part which is Pei-chun running alongside online thickly evaluation in which the overhead is 1 point 3 times we have the green part in which we have offline layer at a time evaluation which introduces an overhead of 4 times and we have offline querying in which we use the straightforward approach of loading the entire provenance graph in memory and we see that the red bar is not able to skate to the level 2 largest datasets so and with this I
conclude my talk so I presented to you a declarative framework for prominence for big graph analytics we introduce we show the provenance can be used in Nobel use cases that go beyond traditional dividing and I show to you that we offer scalable offline evaluation through layer at a time evaluation an online evaluation with that that most that does not acquire any capturing and that enables developers to use provenance for runtime checks and not only retrospective analysis thank you


  420 ms - page object


AV-Portal 3.21.3 (19e43a18c8aa08bcbdf3e35b975c18acb737c630)