We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Efficiently Answering Regular Simple Path Queries on Large Labeled Networks

00:00

Formal Metadata

Title
Efficiently Answering Regular Simple Path Queries on Large Labeled Networks
Title of Series
Number of Parts
155
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
A fundamental query in labeled graphs is to determine if there exists a path between a given source and target vertices, such that the path satisfies a given label constraint. One of the powerful forms of specifying label constraints is through regular expressions, and the resulting problem of reachability queries under regular simple paths (RSP) form the core of many practical graph query languages such as SPARQL from W3C, Cypher of Neo4J, Oracle's PGQL and LDBC's G-CORE. Despite its importance, since it is known that answering RSP queries is NP-Hard, there are no scalable and practical solutions for answering reachability with full-range of regular expressions as constraints. In this paper, we circumvent this computational bottleneck by designing a random-walk based sampling algorithm called ARRIVAL, which is backed by theoretical guarantees on its expected quality. Extensive experiments on billion-sized real graph datasets with thousands of labels show that ARRIVAL to be 100 times faster than baseline strategies with an average accuracy of 95%.
Magneto-optical driveData managementQuery languageComputer networkRegular graphFunction (mathematics)Directed graphVertex (graph theory)outputGenderMathematical singularityComputer fontFreewareGreen's functionRegulärer Ausdruck <Textverarbeitung>Price indexSubject indexingPartial derivativeNumberElectric currentSpacetimeGoogle+Absolute valueAlgorithmRegular expressionSample (statistics)Random numberGraph (mathematics)Hydraulic jumpLengthQuery languageConstraint (mathematics)SoftwareGraph (mathematics)Set (mathematics)Regular graphDifferent (Kate Ryan album)AlgorithmSemiconductor memoryData structureUniqueness quantificationApproximationsalgorithmusDirected graphLimit (category theory)Attribute grammarMultiplicationSlide ruleRegulärer Ausdruck <Textverarbeitung>Subject indexingNumberReal numberVertex (graph theory)Multiplication signHydraulic jumpRandom walkCASE <Informatik>2 (number)outputLoginDistribution (mathematics)Green's functionNP-completeDampingMathematicsBitObservational studyValidity (statistics)Power (physics)Lecture/ConferenceComputer animation
NumberLengthHash functionLogarithmGraph (mathematics)DiameterParameter (computer programming)AerodynamicsTwitterRegulärer Ausdruck <Textverarbeitung>Data typeAbsolute valueBulletin board systemDistribution (mathematics)Query languageBuffer overflowAverageStack (abstract data type)Type theoryTotal S.A.Computer networkPrice indexPoint (geometry)Exponential functionData storage deviceOverhead (computing)CASE <Informatik>NumberAutomatonProcedural programmingMappingLevel (video gaming)Multiplication signType theorySoftwareStack (abstract data type)Buffer overflowDifferent (Kate Ryan album)Graph (mathematics)Term (mathematics)Regulärer Ausdruck <Textverarbeitung>Distribution (mathematics)Vulnerability (computing)CodeLoginTwitterSlide ruleSet (mathematics)ResultantPosition operatorLengthNegative numberLatent heatMultiplicationLimit (category theory)ScalabilitySampling (statistics)Point (geometry)CollaborationismHash functionTelecommunicationSubject indexingMixed realityVertex (graph theory)Parameter (computer programming)Domain nameRandom walkData structureDirection (geometry)ApproximationsalgorithmusMaxima and minimaArithmetic meanState of matterTupleProcess (computing)Moment (mathematics)Query languageMereologyHydraulic jumpCorrespondence (mathematics)Directed graphAlgorithmComputer animation
Point (geometry)Computer networkAerodynamicsGraph (mathematics)Regulärer Ausdruck <Textverarbeitung>Data storage deviceOverhead (computing)Exponential functionRegular graphCodeResultantComputer animation
Transcript: English(auto-generated)
Good afternoon, everyone. I'm Sayan Ranu, and I'll present our work on answering reachability queries on large networks under constraints given by, expressed by regular expressions, and we're particularly focusing on label networks. This work was done with four other authors,
as you can see in the particular slide. So let us first understand what reachability query is. The problem is extremely simple. We are given a graph, and we are given a source vertex and a destination vertex. Our goal is to answer whether there is a path
from the source to the destination. It's a very simple query. So the answer is always either yes or no. If you look at this particular graph over here, you can clearly see that there is a path from the given source vertex V1 to V5, and therefore the answer is yes. But not all networks are as boring as this one.
There are often networks that have labels or attributes characterizing the nodes or the edges. For example, in a social network, you may have labels characterizing the users, such as their age, their country, and so on. So there are many problems where you not only want to identify whether there is a path
from the source to the destination, but also check whether that particular path satisfies some constraint based on the labels that construct that particular path. You can specify the particular constraint, the label constraint, in multiple forms, but the most generic way of specifying them
is through regular expressions. Now, to understand or to look at the past work in this particular area, which is reachability queries with label constraints, a lot of work has been done. The first work was on 1995, and very recently, till 2017,
and maybe even after that, work has been done, not only in academia, but also we see three prominent graph querying engines from the industry that looks into this particular problem. Then, with so much work being done in this particular space, why revisit this problem? What is left to be done? So let us look at some of the issues
that still needs to be addressed. The first issue, which is non-scalability of many of these techniques, when the number of labels in the network, I mean the unique labels in the network, is very large. Why? Because most of these techniques typically employ an index structure to ensure
that the querying times are fast. The memory cost of these index structures often grow exponentially to the number of labels, again, the number of unique labels in the network. So what we show here is the size of the index structure against the number of unique labels
for the state-of-the-art technique published in SIGMOD 2017. What you can see very clearly is that it grows exponentially, first of all, and second, even with just 10 labels, the index size goes to around 12 GB. In real networks, for example, in social networks, such as Google+, the number of labels is 17,000,
and in Freebase, which is a knowledge graph, the number of unique labels is again 7,500, approximately. So clearly, for this kind of networks, existing techniques, or let's say the majority of the existing techniques, are not good enough. Let's look at the next limitation.
So here, we look into what kind of constraints or label constraints can be specified using most of the existing techniques. Now, majority of the existing techniques look into this particular kind of regular expressions, which is essentially whitelisting or blacklisting. So let me elaborate on that. So you're given a source vertex, a destination vertex,
and a bunch of labels that are allowed. So you need to search for paths such that all labels in that path come from the given set of input labels that are given to you. Okay, so these are the whitelist labels. Now, is this a realistic assumption? To answer that question, we looked into the Wikidata query log
and looked into what is the distribution of the different regular expression constraints that have been posed over here. Now, as you can see, this is the constraint that I just spoke about. It's the most popular constraint, no doubt about it, but those constitute about 65% of all the regular expression constraints
that were found, identified in the Wikidata query log. So that means for 35% of the queries, we are still unable to answer them. So therefore, our second goal here is to come up with a technique that can generalize across all these different regular expressions, and we are not limited by what the current algorithm
gives you power to answer. And finally, most of the existing techniques again depend on an index, and therefore, if the network changes, which is not too uncommon for let's say social networks or even knowledge graphs,
you need to rebuild the index from scratch. So it would be actually much better if we do not need to rebuild everything from scratch and if we are able to adapt to changes in the network. So how do we solve all these three problems? This is exactly why we come in. We come up with a technique called Arrival,
and the key idea that we exploit in Arrival to answer this bottlenecks is we go index-free. That is, we do not employ any index structure, and still we are able to answer queries fast and accurately. So let us again go back to the problem formulation
in the presence of a regular expression, which is the constraint that we need to satisfy. Just like we mentioned earlier, we have a source vertex, a destination vertex, and we are given a regular expression as well, which specifies whether this path is valid or not valid.
So we'll look into this particular graph to understand this problem a little bit better, and we'll study this with the particular regular expression, A star, B, A star. So let us first consider this particular path, which is highlighted in red, from V1 to V5. Is this a valid path?
Well, it is not. Why? Although this path actually connects V1 to V5, it does not follow the regular expression A star, B, A star. So therefore, with this path, we cannot say that V5 is reachable from V1. Let us look into the next, another path here, which is V1, then V3, then V2, back to V1, and then V5.
First question, does this satisfy the regular expression? Well, yes. In this case, it does satisfy A star, B, A star. But is this, therefore, a valid path? Well, no, because we are specifically looking for simple paths, that is, paths that do not
contain a cycle, or repeated vertices. So again, this path is not enough to conclude that V5 is reachable from V1. And now we look into the third path, which is highlighted in green. What you can see here, which is V1, V3, V2, V4, V5, this is a path which is simple.
First of all, no repeated vertices. And second, it actually does satisfy the given regular expression. So therefore, with this path, now we can conclude that V5 is actually reachable from V1. So this is the simple problem that we are trying to solve. But what is, so what's difficult about it? Well, this problem is actually NP-complete.
This has been shown by Mendelsohn and Vogt in 1995. And a lot of this, now I want to specify again here, it's not NP-complete for a given specific regular expression. It depends on what is the regular expression for which you want to solve this particular problem. So there are some regular expressions for which it is not NP-complete.
But since we are targeting any possible regular expression, we are actually running into the NP-completeness. So how do we solve this? So we come up with a sampling-based approach. And again, this algorithm is also very simple. And it uses a bi-directional random walk. So what do we do in our algorithm?
What we do is we initiate two different random walks from two different directions, and simultaneously. One from the source vertex, the other from the destination vertex. From the source vertex, we move the random walks, we jump to any of the neighbors, and in the other random walk from the destination,
the walk moves in the backward direction. That is, if there is an edge from U to V, I'm allowed to jump from V to U. Which are the jumps that are allowed? We have two constraints over here. First, any jump must ensure that the path that is being constructed iteratively in this random walk, it remains simple.
That is the first constraint. Second is, a jump must ensure that the regular expression is not violated. So for example, if you're working with the A*, B, A*, regular expression, and there is a jump from a particular node, which takes the label C, that is violating the regular expression. We are allowed to jump only jumps that go through a label
without violating the regular expression. So these are the two constraints for the allowed jumps. How long should this random walk continue? It cannot continue infinitely, right? So there are three cases when we stop the random walk. First, if we hit a dead end. By dead end, I mean we reach a node from which there are no allowed jumps.
What are the allowed jumps? I just defined earlier. Second, the walk exceeds a particular specified maximum length threshold, which we call the walk length. So this is a hyper parameter that we need to set. And third, if we come at a stage where the forward path and the backward path meet to construct a path
which is actually simple and satisfies the regular expression, then we know that there is actually a reachable path from the source to the destination, and we conclude. In case we stop a particular random walk due to the first two conditions, that is we have not yet concluded that they are reachable, then we repeat this procedure till num walk times,
when num walk is the second hyper parameter that we need to set in this particular algorithm. So I'll now explain this with an actual illustrative example. So we have this particular graph, and we also have this particular regular expression for which the automaton is shown over here.
To ensure and to maintain all the information, we use two helper data structures. One is a hash map, in fact we have two hash maps, one for the forward walk and one for the backward walk. And we also store the paths that have been sampled so far using the random walk procedure. Why? Because we need to ensure that the paths are simple.
So let us look how this hash maps are actually used. So initially in the hash map, we just have two entries. The forward hash map contains the entry v1 null, which means that the source vertex, and we haven't reached any particular automaton state yet. It's null. And the value here is in which walks,
in which random walk was this particular tuple was observed. So which is the zero, because this is the first walk that we are actually performing. And same for the backward walk. So now let us start the procedure. We have taken a jump. In the forward walk, we have gone from v1 to v2, and in the backward walk, we have gone from v5 to v4.
Why v5 to v4? Remember, we are going in the backward direction in the backward walk. Now look at the entries over here. So here, the moment I take the jump to v2, I move to the q1 automaton state in the forward walk. And in this particular case, in the backward walk,
we move to the q2 automaton state. So therefore, these are the two entries that I have in the hash map. And the corresponding entry value here, of course, the index of the walk, which is stored here, that created this particular tuple. Now we continue this particular procedure. So here, we take one more jump,
and we go from v2 to v4 in the forward walk and v4 to v2 in the backward walk. Now notice here that these two parts now actually meet at a particular vertex. That is v2. But does that mean we are done? Well, actually not. Why? Because in this forward walk,
we are at the q1 state in the automaton, whereas in the backward walk, we're at the q2 state, which means, although they meet, they do not satisfy the regular expression. So that means we are not done yet, and we need to continue. So can we extend this random walk further? Well, no, because my walk length says that we are allowed only three jumps, or three nodes in a particular walk.
So this walk has terminated, initiate another new walk. That is what we do. So we again start another walk, and now see that this particular entry has one added to it, because this particular tuple was also found in the second walk that we started. So we go from v1 to v3 in the forward walk, and from v5 to v6 in the backward walk.
When we go to v6 in the backward walk, we hit a dead end. There is no other incoming edge to v6. That means we cannot proceed any further. However, the forward walk can continue. So the process continues again. So now v5 in the backward walk, we start another new walk, whereas in the forward walk, we go from v3 to v2, from this to here.
Now notice the following scenario. So when the moment we go from v3 to v2, we reach v2 at the q2 automaton state, okay? And this tuple was already visited in the backward walk as well, in the very first walk. Okay, so they do not need to meet in the current walk. They can meet in some other,
I mean, you can mix and match over here. So that is what happens, which means that if I join these two paths, they are going to satisfy the regular expression constraint that was given to us. So therefore, we join these two paths, v1, v3, v2, with the first backward walk, which is v2, v4, v5, in opposite direction.
And we check are they simple? If they are simple, then we are done, okay? If it's not simple, we keep on doing this till num walk number of times. So essentially, this is like a sampling algorithm. We sample some paths from both directions. In the sample set, if you are able to connect them, we conclude that they are reachable.
If you are unable to find a reachable path, then we conclude they are not reachable. The first question that would come to your mind is, okay, you have these two important hyperparameters, walk length and num walk. How do I set them? So we have certain guidelines for this. Now I would say these are guidelines.
They do not give you a quality guarantee, but they tell you by setting to this particular values, you are likely to get a good accuracy. So you're not giving any worst case guarantees as such, but we are guiding you how to set them. What are the details? Well, please come to our poster. It's the very last session of the conference, so we entice you to come to our poster
and not move out of the conference. And for an even more in-depth understanding, please read our paper. So we'll now move on to the experiments. So here are the different data sets that we have used. So you see that the data sets are from different domains.
So G Plus and Twitter is from the social network domain. DBLP is a collaboration network. Freebase is a knowledge graph. And Stack Overflow is a communication forum, which is a dynamic network. Second point I want to highlight is that Twitter is really large. It contains about two billion edges. So we actually want to show that we scale.
And finally, this particular column is very important because most of these networks, they contain thousands of labels, or at least, Stack Overflow is an anomaly one here, but the other ones, they all contain high number of labels, like at least 500. We have labels on nodes as well as edges,
and you can specify the constraint by mix and matching both node labels as well as edge labels. So what are the baselines? Now, we consider three different baselines. One is the exhaustive search, which we do in a bi-directional manner using breadth-first exhaustive exploration.
We also compare with two state-of-the-art techniques by Walsdorf et al. and Korschmider over here. In this particular talk, I will focus mostly on the BBFS as the baseline because this was the only technique that we could, this is the only technique that we could apply
on all the data sets, because the other data sets have certain limitations in terms of the scalability. But the experiments are there in the paper. How did we come up with the regular expression queries? Again, we go back to the Wikidata query log. We choose them from the same distribution of the query types that we see
in this particular query log. So this is a distribution that we have in the Wikidata query log. So first question is accuracy. We are an approximate, it's an approximate algorithm. How do we do in terms of quality? The first point I would like to highlight is
that precision is guaranteed to be one. There would be no false positives. Why? Because when we say that the destination is reachable from the source, we've actually found a path that satisfies the regular expression constraint, right? So therefore, that path actually exists in the network. So yes means it is there.
But there could be false negatives. So therefore, the accuracy that we measure here is in terms of the recall. And what you see here is that across all networks, the recall is fairly high. It is 0.92 on average. And I think the lowest it was on Stack Overflow where also it was close to 0.85.
I mean more than 85. Now is this only because we are doing well on some specific query type? Well, not really. We measured the performance across multiple different query types. All these query types consume 97% of all queries posed in the data query log. And we still perform as well as in the previous slide.
What about speedup? We are 500 to 100 times faster compared to BBFS, the exhaustive search, depending on the data sets on which it was used. And this result also transforms, I mean this result also persists across different query types. And not on just an average.
You might be wondering how fast is it in terms of running time, not just speedup. Well, we are doing it in milliseconds. And this is on Twitter, which is the largest network that we used in our experiments. So I want to leave with the following key takeaway point. The weaknesses that we addressed in this particular work is that we want to be able to scale to networks
with large number of labels. We should be able to do it on dynamic networks. And we want to support any regular expression. All the regular expression that was there in the Wikidata query log. How did we manage to do this? We managed to do this using a sampling-based approach, which is index-free. So it allows us to answer these weaknesses.
Finally, the code is available. We have optimized the code even more than when we reported these results. You may actually see better results than what we reported in the paper. So I would welcome you to please go and use this code. And hopefully you will get good results. Thank you.