We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Current Challenges in Graph Databases (Invited Talk)

00:00

Formal Metadata

Title
Current Challenges in Graph Databases (Invited Talk)
Title of Series
Number of Parts
25
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
As graph databases grow in popularity, decades of work in graph query languages and models are materialising in industry standards and in the construction of new graph database systems. However, this surge in graph systems has in turn opened up a series of new, interesting research problems related to graph databases. Our first set of problems has to do with more efficient ways of computing the answers of graph queries, specifically graph patterns, path queries, and combinations between them. Traditionally, researchers in graph databases have pointed out that relational systems are ill-equipped to process these types of queries, and if one looks at the performance of native graph database systems, there is clearly a lot of room for improvement. The talk focuses on two possible directions for improving the state of the art in graph query processing. The first is implementing worst-case optimal algorithms for processing graph patterns that traduce in relational queries with several joins. Some advances are already in development (see e.g. Nguyen, Dung, et al. "Join processing for graph patterns: An old dog with new tricks." GRADES'15. or Hogan, Aidan, et al. "A Worst-Case Optimal Join Algorithm for SPARQL." ISWC’19.), but we are still far from a full fledged solution: most algorithms require complex data structures, or need further support in terms of heuristics to select an order in which joins are processed. Second, we need to understand what is the best way of evaluating path queries (that is, finding all pairs of nodes connected by a path), in such a way that these results can be further integrated with other query results in a graph system pipeline. We already have complexity results regarding path computation and enumeration for different semantics of path queries (see e.g. Martens, Wim, and Tina Trautner. "Evaluation and enumeration problems for regular path queries." ICDT'18. or Bagan, Guillaume, Angela Bonifati, and Benoit Groz. "A trichotomy for regular simple path queries on graphs." PODS'13.), but still very little is known in terms of optimal processing of path queries when inside a tractable fragment. Our second set of problems is related to graph analytics, one of the current selling points of graph databases. Systems should be able to run more complex analytical queries involving tasks such as more complex path finding, centrality or clustering. It is also important to be able to run these algorithms not over native graphs, but perhaps over a certain set of nodes or edges previously selected by a graph query, and one may also want to pose further queries over the result of the analytics task. Finally, all of this should be done in an efficient way, specially in the prospect that graph databases may contain a huge amount of nodes. In this talk I will discuss possible approaches to perform these operations, covering aspects from the design of languages for graph analytics to efficient ways of processing them, and also comparing the expressive power of graph analytics solutions with other forms of graph computation.
Graph (mathematics)DatabaseGraph theoryGoodness of fitMultiplication signComputer programmingDisk read-and-write headFood energyGraph (mathematics)DatabaseSystem programmingComputer animationProgram flowchart
Graph theoryDatabaseSystem programmingMereologyGraph (mathematics)System programmingBuildingMereologyGraph (mathematics)DatabaseLine (geometry)Computer animation
Query languageGraph (mathematics)DatabaseAlgorithmFormal languageMultiplication signAnalytic setReal numberBitComputer clusterQuery languageComputer animation
Computer fontQuery languageAlgorithmGraph theoryModel theoryVertex (graph theory)BuildingBlock (periodic table)Graph (mathematics)Variable (mathematics)Reduction of orderData storage deviceAreaGraph theoryGraph (mathematics)Category of beingModel theoryGraph (mathematics)BuildingFormal languageVapor barrierVariable (mathematics)Block (periodic table)Pattern languageComputer animation
Block (periodic table)BuildingGraph (mathematics)Variable (mathematics)Vertex (graph theory)Pattern languageSheaf (mathematics)Graph (mathematics)Matching (graph theory)Variable (mathematics)Moving averageDivisorSemantics (computer science)Observational studyQuicksortQuery languageTable (information)Computer animation
Variable (mathematics)Vertex (graph theory)Semantics (computer science)Query languageSemantics (computer science)Standard deviationHomomorphismusMatching (graph theory)View (database)Game theoryWebsiteFormal languageLengthPattern languageGraph (mathematics)Query languageMechanism designComputer animation
Regular graphRevision controlQuery languageoutputExecution unitVisualization (computer graphics)InformationRight angleGraph (mathematics)Pattern languageResultantTime zoneQuery languageWordRelational databaseExpressionNeighbourhood (graph theory)Regular graphMessage passingGraph (mathematics)Computer animation
Revision controlQuery languageRegular graphComplex (psychology)Graph theoryQuery languageGraph (mathematics)Formal languageDigital photographyBus (computing)ExpressionCASE <Informatik>Local ringPattern languageVertex (graph theory)Object (grammar)MereologyRelational databaseBit rateFreezingComputer animation
Real numberQuery languageComplex (psychology)Semantics (computer science)Pattern languageArithmetic meanRegulärer Ausdruck <Textverarbeitung>Query languageExpressionSemantics (computer science)Computer animation
Query languageSemantics (computer science)Regular graphSemantics (computer science)Graph (mathematics)Regulärer Ausdruck <Textverarbeitung>Message passingQuery languageComputer animation
Query languageSemantics (computer science)Vertex (graph theory)Regular graphBeat (acoustics)ExpressionRelational databaseVisualization (computer graphics)Regulärer Ausdruck <Textverarbeitung>Message passingVertex (graph theory)Semantics (computer science)Graph (mathematics)Medical imagingDemosceneComputer animation
Semantics (computer science)Extension (kinesiology)Graph (mathematics)Regulärer Ausdruck <Textverarbeitung>Semantics (computer science)System programmingObservational studyExtension (kinesiology)Different (Kate Ryan album)AlgorithmGoodness of fitAreaSeries (mathematics)Mathematical analysisRegular graphComplex (psychology)Graph (mathematics)Pattern languageMessage passingMixed realityQuery languageTerm (mathematics)Computer animation
Extension (kinesiology)Archaeological field surveyCategory of beingFile formatSystem programmingWave packetArchaeological field surveyInformationExpressionMessage passingFormal languageCondition numberQuery languageComputer animation
Game theorySystem programmingObservational studyEncryptionRelational databaseMessage passingLine (geometry)Category of beingComputer animation
InfinitySemantics (computer science)Graph (mathematics)Semantics (computer science)Near-ringStandard deviationBuildingGenetic programmingFormal languageAreaRange (statistics)Message passingNumberEncryptionTask (computing)InfinityBitGraph (mathematics)Graph (mathematics)AlgorithmSystem programmingBenchmarkComputer animation
Computer fontQuery languageAlgorithmPattern matchingBuildingFormal languageAutomatic differentiationState of matterTerm (mathematics)Block (periodic table)Multiplication signDirected graphQuery languageGrass (card game)Graph (mathematics)BitPattern languageInformationComputer animation
AlgorithmQuery languagePattern languagePattern matchingGraph (mathematics)System programmingComputer configurationQuery languageAlgorithmGraph (mathematics)System programmingDatabasePattern languageRelational databaseUsabilityCASE <Informatik>Graph theoryComputer animation
Query languageGraph (mathematics)AlgorithmPlanningLie groupDifferent (Kate Ryan album)Series (mathematics)Vertex (graph theory)Pattern languageQuery languageLine (geometry)TupleMaxima and minimaAlgorithmPower (physics)NumberGraph (mathematics)Graph (mathematics)Semiconductor memoryComputer animation
Query languageTouchscreenInstance (computer science)CirclePattern languageVertex (graph theory)Matching (graph theory)Query languageComputer animation
Query languageFinitary relationDirection (geometry)Arrow of timePrice indexSquare numberObservational studyRule of inferenceQuery languageData structureVariable (mathematics)Level (video gaming)Relational databaseArithmetic meanGraph (mathematics)Subject indexingAdaptive behaviorSystem programmingGraph (mathematics)Direction (geometry)Sheaf (mathematics)Computer animationDiagram
Query languageArrow of timeDirection (geometry)Price indexFinitary relationGraph theoryGraph (mathematics)ResultantGame theoryTerm (mathematics)TouchscreenAuthorizationGraph theoryData structureSubject indexingQuicksortState of matterDiagramComputer animation
Query languageGraph theoryFinitary relationArrow of timeDirection (geometry)Price indexAlgorithmMathematical optimizationHeuristicSystem programmingQuicksortSequenceQuery languageGraph (mathematics)Pattern languageComputer configurationResultantDigital photographySimilarity (geometry)AreaFrequencyMathematical optimizationProcess (computing)Best, worst and average caseData structureAlgorithmDatabaseComputer animation
Mathematical optimizationQuery languageAlgorithmData structureBenchmarkSystementwicklungData structureBenchmarkPercolation theoryDigital photographyBitStandard deviationFrequencyQuery languageQuicksortPattern languageSystem programmingPointer (computer programming)Computer animation
AlgorithmBuildingSystem programming1 (number)Query languageQuicksortComputer animation
AlgorithmDecision theorySemantics (computer science)Connectivity (graph theory)Query languageExpressionVertex (graph theory)Musical ensembleTransport Layer SecurityWorkstation <Musikinstrument>Decision theoryComputer animation
AlgorithmDecision theoryConnectivity (graph theory)Semantics (computer science)Query languageVertex (graph theory)Decision theorySemantics (computer science)Different (Kate Ryan album)MathematicsGoodness of fitMessage passingQuery languageComputer animation
Decision theorySemantics (computer science)Connectivity (graph theory)P (complexity)Graph (mathematics)Observational studyGame theoryQuery languageExpressionComputer animation
Decision theoryConnectivity (graph theory)Vertex (graph theory)Mathematical singularityDecision theoryComputer animation
Semantics (computer science)Connectivity (graph theory)Vertex (graph theory)Decision theoryAutomatonVideoconferencingSemantics (computer science)Execution unitVertex (graph theory)BitGraph (mathematics)Matching (graph theory)Multiplication signDecision theoryExpressionState of matterAutomatonProduct (business)Graph (mathematics)Musical ensembleDegree (graph theory)Workstation <Musikinstrument>Computer animation
Semantics (computer science)Connectivity (graph theory)Vertex (graph theory)Decision theorySystem programmingGraph (mathematics)Vertex (graph theory)Query languageSystem programmingQuadratic equationPattern languageNumberGoogolClosed setComputer animation
Vertex (graph theory)Connectivity (graph theory)Decision theorySemantics (computer science)Similarity (geometry)Real-time operating systemMultiplication signSampling (statistics)Semantics (computer science)ExpressionPolynomialComputer animation
Semantics (computer science)AlgorithmMathematical optimizationSystem programmingFrequencyQuery languageMathematical optimizationMatching (graph theory)Single-precision floating-point formatTouchscreenSerial portComputer animation
TupleQuery languageClosed setRule of inferenceObservational studyPattern languageTable (information)Computer animation
Vertex (graph theory)Vertex (graph theory)MathematicsPattern languageNatural numberMereologyCASE <Informatik>TelecommunicationComputer animation
Semantics (computer science)Mathematical optimizationAlgorithmAlgorithmPattern languageNeuroinformatikCombinational logicLocal ringOperator (mathematics)Machine visionBasis <Mathematik>SummierbarkeitUniform resource locatorComputer animation
Computer fontQuery languageAlgorithmGraph theorySystem programmingIterationFamilyCoefficient of determinationGraph (mathematics)MereologyAnalytic setDatabaseData storage deviceMeasurementBuildingSingle-precision floating-point formatGraph (mathematics)Centralizer and normalizerRankingTerm (mathematics)Graph theoryVertex (graph theory)Category of beingCartesian coordinate systemWeb pageIterationQuery languageWeightComputer animation
Graph theoryEnterprise architectureQuery languageStreaming mediaBeat (acoustics)System programmingEncryptionGraph (mathematics)Query languageProgramming paradigmLevel (video gaming)Vertex (graph theory)AlgorithmDatabaseParalleler AlgorithmusEnterprise architectureQuicksortSynchronizationStaff (military)HypermediaWordComplete metric spaceAdditionComputer animation
Graph theoryIdeal (ethics)DatabaseGraph (mathematics)Formal languageIterationComputer programmingTable (information)HookingProgramming languageAlgorithmIterationDatabasePattern languageQuery languageLevel (video gaming)Graph (mathematics)Formal languageStandard deviationTable (information)Computer animation
Software frameworkComputer programmingTable (information)Vertex (graph theory)Mathematical optimizationFood energyAreaOffice suiteString (computer science)Procedural programmingQuery languageNeuroinformatikPower (physics)Graph (mathematics)Programming languageSoftware frameworkAnalytic setGraph (mathematics)DatabaseComputer animation
Graph theoryTask (computing)DatabaseGraph (mathematics)Group actionVertex (graph theory)Group actionVideoconferencingProcedural programmingHypermediaAreaDatabaseGraph (mathematics)RankingWeb pageTask (computing)Pattern languageXMLComputer animation
Graph theoryTask (computing)DatabaseGraph (mathematics)Group actionVertex (graph theory)RankingWeb pageFocus (optics)WeightSweep line algorithmHypermediaBus (computing)WeightVertex (graph theory)Artificial neural networkDegree (graph theory)Task (computing)DivisorRankingNormal (geometry)SynchronizationShared memoryCodeXMLComputer animation
Web pageRankingMessage passingVertex (graph theory)Local GroupWeightQuicksortMessage passingWeightTheoryReduction of orderVertex (graph theory)Computer animation
Local ringGraph (mathematics)IterationGraph theoryTask (computing)Message passingDatabaseSystem programmingQuery languageGraph (mathematics)Pattern languageAnalytic setNeighbourhood (graph theory)Software frameworkVertex (graph theory)Local ringIterationCartesian coordinate systemDatabaseGoogolRelational databaseComputer animation
Local ringGraph (mathematics)IterationTask (computing)Graph theoryDatabaseMessage passingFormal languagePrototypeComputer configurationRankingPattern languageWeb pageWeightVertex (graph theory)Local GroupMilitary operationBitFormal languagePrototypeMatching (graph theory)Relational databaseMaxima and minimaShooting methodKey (cryptography)WeightGrass (card game)QuicksortPattern languageVertex (graph theory)Graph (mathematics)Operator (mathematics)Message passingSystem programmingPhase transitionSoftware frameworkDegree (graph theory)IterationComputer animation
WeightProgramming paradigmGraph theorySoftware frameworkModel theoryVertex (graph theory)Computer programmingFunction (mathematics)Message passingComputer networkThermal radiationLevel (video gaming)Moment (mathematics)Strategy gamePoint (geometry)ForceFunctional (mathematics)Message passingBuildingSystem programmingMultiplication signProgramming paradigmAnalytic setGraph (mathematics)Model theoryArtificial neural networkProcess (computing)Reduction of orderPoint cloudDifferent (Kate Ryan album)Vertex (graph theory)AlgorithmFlow separationPairwise comparisonDatabaseComputer animation
Graph theoryProgramming paradigmTask (computing)Graphics processing unitDatabaseArtificial neural networkMultiplication signGraph (mathematics)Programming languagePattern languageDirection (geometry)Form (programming)Message passingFormal languageBitTask (computing)Phase transitionVirtual machineMatrix (mathematics)PrototypeAnalytic setOperator (mathematics)Programming paradigmDatabaseElectronic mailing listGroup actionSqueeze theoremAdditionMetropolitan area networkComputer animation
Computer fontQuery languageAlgorithmFocus (optics)System programmingWordSystem programmingGraph (mathematics)DatabaseReal numberRule of inferenceLine (geometry)Computer animation
Transcript: English(auto-generated)
Hi, everyone. I'm Juan. Thanks for joining my tutorial, and I hope everyone is okay given the current circumstances. I'm filming this from my flat, of course, but I'm here to tell you about research problems for graph databases. Let's get into this. Why graph databases and why now? This is a topic that has been on for a long time,
but it doesn't slow down. There are new and new systems appearing all the time, so we go from the most old RDF, SPARQL systems to Neo4j that has some years over now, and
then there are new systems appearing all the time. The latest one is Amazon Neptune, TigerGraph, and so on. There is a lot of movement, and it doesn't seem that this will slow anytime soon. Really, we expect that this will grow faster in the next years. There
has been a lot of research as well from our community and from other communities regarding graph databases, but my personal feeling is that research has focused on different parts of what you would need in a graph database system, but not so much on
putting everything together. Here in Chile, we had to do the exercise of thinking on how would a graph database system look that would actually use all this cool stuff that we have developed in our community in the last years. This was a nice exercise, but it was not that
easy when actually it created a lot of problems. We ended up with a lot of new problems that we had to think that we're still thinking on and that I want to communicate with you. Hopefully, you will be as excited as I am with these new problems, and
this will open up new lines. I don't have much time, of course, but I'm going to comment on three aspects. First, I'm going to talk a little bit about query languages and what's on in practice, in real life, what's out there. This will lead us to algorithms, which I think is one
of the most basic needs that we are lagging right now. I'm going to finish with analytics that I also think it's the future and that we have reasonable attacks for doing it right now. Just a small disclaimer, of course, I cannot cover everything. This is not intended to be
like an encyclopedic tutorial. I don't know everything. My interest here in this tutorial is to try to engage you with new problems and hopefully try to produce fruitful discussions. This is my goal. Let's start with query languages. First of all, there are a lot of
graph models over there. The most common are RDF and property graphs. I'm not going to be based on any of them. Just for simplicity, the graphs in this talk will be simply nodes that have
labeled edge connecting them together. The nodes will represent our entities, and this label edge will represent our relationships. The basic building block of graph languages are patterns. This is pretty much in every graph language that we have deployed now. Patterns are simply small graphs that we want to match.
In order to match them, we're going to replace some nodes with variables. We could replace edges with variables as well. Then if we try to evaluate them into this example graph, the idea is that we're going to find for all possible matches of variable section
y such that this pattern is realized in the graph. Then this is a match. This is an answer of the pattern. It's Tom and Ann. It's the match for x and y. Of course, we can flip this the other way around and find this other match. We map x to Ann and y
to Tom. This is another answer to the pattern. What's happening here is that we take this pattern, which is a small graph. We try to match it in several ways into this big graph, and the answer of this query language is a table containing all possible matches.
We could obtain other semantics if we restrict or relax the way that we're matching. Right now, I'm using standard homomorphism matching, but if you consider, for instance, injective
mappings, then you would have another semantics and so on. In real life, again, most query languages are based on patterns. I'm just going to give you two examples of how you would write this. This is Cypher. This is the query language of Neo4j. You can see that really the language is designed to specify patterns. This is truly the basic query mechanism of Cypher.
You just specify the edges of the pattern that you want to match. Spark here is not different. The syntax changed a little bit, but not the spirit. Again,
you're specifying the edges of the patterns that we want to match. The pattern has four edges, and we're basically querying for four units of information here. The other important thing that I want to talk about is path query. This is one of the key things that would distinguish a relational database and a graph database. The idea is that
in graph, we are also interested in obtaining relationships that are not local to the neighborhood of a node, but in a sense, they are global. They could be very far away. The simplest way to do this is to issue what is known as a regular pass query. It's just a
node connected via a path which is labeled in a word that belongs to this expression. Just to see some examples, below we have a path. It's a path between
and and team. The label of this path is friend of follows. I'm just concatenating the edges together. Because this friend of follows belongs to the language of the regular expression, we say that and and team is an answer for this query. Likewise, and and if is also an answer
for this query. The label of the path in this case is friend of follows follows. It becomes more interesting when you merge these two things together. Patterns on one hand and a path on the other. You can start querying for both local things and global things. This
is an example of such a pattern. It basically asks for some node that is followed by both Tom and Ann, but it relaxes this follow relationship and it allows any path of followers. Team would be an answer here because team is followed by both Tom and Ann, but also if would
be an answer because there is a path between Tom and Ann of follows relation. This will be one of the key objects of my talk because this is one of the parts where we can make more impact
because not much is known about evaluating these things. I'm going to go into that later. So, again, most query languages in practice allow for complex patterns. Just going back to Cypher and SparkQL, the way that you write this pattern in Cypher is, as I'm showing, it's
basically the same pattern as before. The only thing that changes is now I'm putting a star on follows in the last two edges. It has the same meaning. In SparkQL, it works the same way.
You just specify the regular expression directly. Most query languages allow for complex patterns. This means that we need to take care of this, but there is one caveat that the semantics of paths now differ between systems. So, let's imagine that now we have friend of plus follows
means that we need at least one friend of followed by a follows edge. There is a possible answer for this regular expression. It's a path that involves a repetition of Ann. So, you go from Ann to Tom, be a friend of, then from Tom to Ann, be a friend of,
and then you go to Eve. So, what we defined before as the semantics is known as the arbitrary semantics for pass queries. In arbitrary semantics, you don't care about the pass, so any pass is good, but there are other two semantics which are interesting here. So,
the node repeated node semantics would disallow pass that repeats nodes. So, in this case, Ann and Eve would not be an answer of this regular expression. Even though there is a pass, there is no pass that doesn't repeat any node and that satisfies the regular expression.
There is also no repeated edge semantics, which is the same thing but doesn't allow you to go through the same edge twice. Again, in this particular pass, we don't repeat edges, so now Ann and Eve is an answer again to this regular expression.
So, Cypher allows for regular expressions in the edges, but it uses no repeated edge semantics. But Sparkill also allows for this, but it uses arbitrary semantics. So, the way that these systems behave is a bit different. This is also good news for us, of course, because it
means that we need to study different semantics and different algorithms and things change between them. The last extension that I'm going to briefly talk about is this idea of manipulating pass. This is one more layer of complexity in terms of graph query languages. The rationale
here is that these regular pass queries, these regular expressions, don't always reflect what I want in a pass. So, we can take the same pattern, but we can now ask, for instance, for adult followers. There is no way that we can check this relationship with this
mix of patterns and paths. You can actually prove that. But basically what happens is that you're looking for more complex paths. You're looking for paths, and now you want to be testing properties along this path. There is a lot of research in this area. I'm pointing here
at a survey that has a lot of information. One can, of course, increase the expressivity of pass languages, but they become much more involved. What users end up preferring, I think, because what happens in systems at the end is that we just want to manipulate
paths. We want to take this path, extract them, and then produce a query that checks our conditions. We don't want more expressive path expressions, apparently. So, it's not clear that stronger pass language would make up for this. This is just a short example of why systems
actually allow to retrieve pass. So, this is, again, a cipher query, and it's matching a pass. It's a strange functionality, but it basically says – so, the blue line says, well, I'm going to be interested in paths that go from X to Tim via follows relation.
And now the green part, it says, I'm going to force that all X in this path – so, all nodes participating in this path – satisfy a particular property. It can be anything. And then I'm going to return this path, actually. So, the key thing here is that this creates an
extra need. One must return pass, and there is a node also. Possibly the reason why cipher and Neo4j uses this node repeated edge semantics is because under this semantics,
paths are finite. So, you can actually talk about returning all paths. In arbitrary semantics, if you have a loop, then you have an infinite number of paths, so the task of retrieving all possible paths is a bit more complicated. And there is also a standard, one of the two graph standard languages. This is gcore from the Linked Data benchmark community. It also involves
manipulating paths, and they are pushing forward for these standards to start to be implemented in systems. Okay, so, algorithms. So, what have we learned, basically, with this very quick review
on graph query languages? So, there are three things that we need to be doing. We need to solve pattern matching fast, of course. This is the basic building block. We need to be able to match patterns and also use paths. This is not a basic building block, but it's something that's being used often, and where you can see that there is room for improvement,
and we need to return and manipulate paths. This is something which is a bit more obscure, but it's also something which is more challenging. So, I am not going to talk about this last item, because we already have a good understanding. So, you can start with this paper
by Martens and Troughton. Of course, there is a lot to do still, but we're already tackling these problems, and you can read this if you want more information. So, about improving algorithms for pattern-based queries. Again, this is the most important part,
and the key thing is that relational systems, they don't cope well with patterns, and the reason for this – the main reason for this is that when you translate the graph into a relational database and translate the pattern into a relational query, you end up with something
that has a lot of joints. This is not good for relational systems. They don't know how to deal with this. So, of course, if you want to build a graph database system, you have to get around this issue, and every system has their own option, but I'm going to talk today
about something that has been studied in the community, but not so much for graphs. We have not seen it in real systems yet, and this is what we're trying, actually. So, it's worst-case optimal joints. The basic idea behind these joints is that they can speed up queries
by opting out of the traditional left-deep plan, which is something on the lines of what we have here, opting instead for a different way of computing queries where they start picking
node by node on the pattern. So, by doing this, you can actually shift from a quadratic algorithm, which is this left-deep plan. At some point, you have a quadratic query that you need to store in memory to an algorithm that works for this particular pattern in the size of the graph
to the power of 1.5, which is the AGM bound of the query, the maximum number of tuples that the query could produce. Just a very brief overview of how these algorithms work. So, it's
the idea of selecting node by node of the pattern. So, for instance, we can start with X, and we're going to say, okay, so we will iterate over each possible value of X that we know could be a
candidate answer for this query, so it has to satisfy the pattern below, which is like a sub-pattern of the complete thing. For each of these values, we're going to put it in the pattern, and now we're going to look for each node B that is a possible match to Y in the second
query. So, now we're going to look for all nodes that are connected to A and that can still be answers with respect to the other variables, and we're going to continue like this until we run out of variables. So, in this case, at the end, we're going to check for Cs that we map to set in this query where we already have instantiated A and B. So, this is the basic idea. Of course,
if you want to do this, you need to do it properly. You need some data structure support. In particular, if you want to implement this on graph databases, you're probably going to
end up needing indexes in all relations with respect to both directions of the arrow. So, if I have a node, I need to be able to know all the neighbors, outgoing or incoming neighbors under any possible edge label, and there are some more technical things. You need support
for adaptive intersection, which is not necessarily something that systems can normally do. This is also important. You need to somehow decide that X, Y, Z was a good order for this query, because if we decide, for instance, to start looking for a set before, we're going to end up with a similar algorithm, but instead we're going to look for all values for set,
and for each of these values, we're going to produce a new pattern, and so on. This has been implemented in graphs first and last year for RDF and SPARQL, and it produces quite good results, but they were actually dependent on this good order
and on the particular index structure that the authors of this paper decided to implement there, which means that there is a huge, huge gap of unknowns that we have no idea yet what actually works best, what doesn't work, what sort of heuristics can we make to decide good orders,
and these are also questions that we really need to answer in systems if we want to start implementing these things. Also, from a higher-level overview, when we want to issue a pattern over a graph database, there is basically the question of a query plan,
and now we have two options. We can live with both worst-case optimal and traditional algorithms and somehow manage to decide online which is a good option. I'm not aware of any work that
has studied that option as well, or we would just go with a worst-case optimal machinery and process everything in a system with this idea. Again, it is not clear how to do this for a graph database system because you need to start putting more things on top and below,
so deciding these things is another interesting problem. Again, the data structures that support all of this, we don't know really what works and what doesn't work that much. Also, we need better benchmarks. This is important. Most of the benchmarks are either
relational-based or are focused on certain particular experiments and patterns and queries that people should do. Here are two pointers that have good benchmarks, but this is something that is needed by the community as well. I'm going to stop here, but these are all
really exciting questions for systems development. We have a new sort of backbone for our algorithms, and now we have to manage and decide all the other components so that they play together with this backbone. This could be good. We don't really know, of course, whether
all of this will actually end up being worth it, but it is certainly something that we need to understand. For another challenge, now when you start putting paths into this, now this is really getting into unknown territories. In previous work, we have surveyed that
traditional systems don't work very well with these sort of queries. Even the most modern ones, they suffer when you start doing patterns, especially complex patterns, and you start
putting paths everywhere. This is not an easy query to process. What is the roadmap for doing this? I think that the reason that this is not very well implemented yet is because there are several things that we don't really understand so far.
You need to understand patterns and paths. First, of course, we need to understand how to compute pairs connected by paths, which is called the all-pair problem for these past queries. I want to find all pairs connected by expression, or if I have one node, I want to find
all Bs that are connected to this node A via this expression. This is also related to the decision problem. Now I have two nodes, and I want to see whether this node A is connected to node B. And there is also the problem of different semantics because this really changed for
different semantics. Just a small recap, what do we know? If we are talking about arbitrary semantics, the decision problem is very easy. It's just a reachability query. We start in this node, and we need to reach this other node in a way that respects the path. But it actually
gets harder depending on the semantics. So if we start using no repeated something, no repeated edge, no repeated node, what Neo4j uses, then this very simple problem is already complete. So luckily for us, there is, again, there is interesting work here. We have dichotomies,
and we have tractable fragments. We also, more or less, there are also studies that, so I don't have it here, but at least well Bonifati and authors, they also have this study where they log into query logs, and they really discovered the expressions that people
were using. And it seems interesting to match them with the tractable fragments from this work. More or less, we know what you can do and what you cannot. But now there is the problem of lifting this into the all-per computation thing.
So we know the decision problem. What can we say about computing everything? And it's not that easy when we talk about paths. So the naive approach, of course, is we're going to check each pair of nodes, and we're going to check if they are connected. So if they are connected, we invoke the decision problem. We check if they are connected. If they
are connected, this is an answer for the query, and if not, we just continue to the new path. So the question is whether we can do better. This seems to be a very slow algorithm. So right as it is right there, it's already quadratic, not counting the cost that we need
to actually check the decision problem. So again, if we're talking about arbitrary semantics, this is already established. I'm going to talk you through this. So the problem of computing all pairs, you can reduce it to checking all possible
bits that are connected to a node A and then iterate over every possible node A. So you would just iterate over nodes on the graph and not over pairs of nodes. So let me be more precise. So if you want to solve for a particular node A, if you want to check all bits connected to A, you need to do BFS
over the product of the automaton of the expression and the graph. So you build the automaton for this expression, and now you start in node A q0. So q0 is the initial state of the automaton. And then you do BFS, outputting all reachable nodes that look like B
and a final state of the automaton. And this way, you check all bits that are connected to A. So what is the cost of this? It's vertices plus edges. This is just BFS. It's a standard thing that you can implement. And what is interesting is that this is basically the same
time as what it would take for you to solve the decision problem. So you don't really lose, asymptotically, you don't really lose match for checking for all nodes instead of a particular node. So now if you go back to the old pair problems, you can do this for every possible
node on the graph. And you're going to end up with a more or less with a quadratic algorithm, which is actually much better than the Naive approach, because in the Naive approach, you actually need to do a quadratic number of BFS checks, which is much slower.
So this is good for BFS. We have actually tried this in systems, and the approach is quite promising. We were able to beat regular systems in checking past queries. And, well, we don't know whether this will be suitable for feeding it into patterns. I'm
going to go there in a short while. But let me just say for now that this problem, I have not seen the way to solve this. So we know that there are fragments for which non-repeated sampling semantics is polynomial time with respect to the decision problem.
We don't know, or at least I don't know whether there is an efficient way of actually computing all pairs connected via passing this expression under no repeated sampling semantics. So this is
open, and I think it's an interesting question. So now, but coming back, so we have an interesting way of actually computing the all-pair past-query problem. We just use BFS, and now the big problem comes when you go back into a system, and now I want an optimal algorithm for this query. I want to actually compute this query. So can we extend worst-case
optimal algorithms? And just to tease you with the idea of the problem, basically you can think of several ways of actually computing this query. So one possible way is that you start local. So you basically compute all nodes that are friends of each other,
and then for each such match, a, b, to this local pattern, the first pattern, you're going to compute all nodes set that you can reach from both a and b. Again, because you can compute the answers for this query by doing a single BFS,
what you need to do at the end to actually get the answers of these patterns is two BFS per each table that is an answer to the friend of pattern, the local pattern.
So this is one possible approach, but of course there is another approach that you can start global. You can actually start computing everything that is following and then check local in that answer. So basically, for each node n in the graph, I'm going to check all
a, b such that a and b are actually reached to n, and for each of these pairs a, b, I'm going to check now whether they are an answer to the pattern. So what changes here? Well, I'm doing one BFS per node in the graph, which should be less efficient,
but then once I do this, I just have to check this local part among all answers from this node n, and of course you can catch it in case you get it for more n's and so on, right? So there are some cases where this could actually be more efficient than the other approach.
So we don't know yet, but this is what we're looking for. We're looking for algorithms for patterns and combinations of patterns that somehow can decide when to compute paths right away. So when do you start with the global operation or when to delay the computation of them, start with local and try to delay the computation of paths as much as possible.
So the intuition is that you should be delaying the computations of paths as much as you can, but it's not that clear when you start looking at examples. So again, this is another interesting question that we should be looking at.
Okay, so the last part of the talk is about graph analytics. So one of the things that people are requiring right now to graph data is they have to deploy these so-called analytic queries on top, and this is something which is quite general out of all the applications
that are looking for graph databases. The term analytic query is not a very well-defined term. For this talk, I'm going to think about graph analytics as measures or properties of nodes or edges or graphs or whatever that can be obtained via a mixed approach of querying and
iterating. So think of centrality measures like page rank between a centrality or community detection, clustering, weighted paths, node profiling, anything that involves querying and iteration. For me, this is going to be analytics. So what do we have currently?
Right now, you can either deploy a graph system that may have some built-in features so this is again a cipher query and they have a built-in shortest path feature. You can invoke
it like this so you basically match a start node, an end node, and then you call this algorithm with a particular cost and then you say, well, just give me the shortest path from the start node to the end node and this will be what the algorithm returns.
Or the other possibility is that you can actually invoke a completely different architecture. You can leave your graph database system behind and then you do this on a distributed
architecture or a synchronous parallel map review sort of variant or anything like this. Two of the most well-known paradigms are, well, you have Galois, you have GraphLab and these paradigms they have built on top so they have been renewed and there are more and
more works going on. But coming back to a graph database system, so if we really want to do query and iteration in a graph database and we want to be able to use the algorithms that we designed before, everything that is related to pattern matching, the obvious solution is not
ideal. The obvious solution is just to hook up a programming language to a graph database and then you iterate on the programming language level and then just use the database to retrieve data. So this is SQL running stored procedures, right? Patterns retrieve relational
tables and this is what programming languages must work with. So what you end up having is you reutilize this idea of SQL stored procedures and you use it to compute these analytic queries. So the performance of these procedures is typically not optimal
when compared to all these other frameworks of distributed computation. But I believe that it is for a very simple reason. So analytics query are much more simple than the entire power that the programming language and a database can give you. They just usually iterate over all nodes or all edges and they do
the same computation over and over again. But this framework of doing whatever you want with programming language and a graph pattern, this is just too general so you will never be able to optimize it. So you get a lot more expressive power but then what you have to pay with
is just for the efficiency. So what do we want instead? We want, of course, we need a procedural way of specifying analytic tasks some way. If we want to say that we need to compute some page rank variant, we cannot do it in a declarative way. It's not the idea that we
have everything compiled before. We just need a procedural way of saying what we want to do. But we want this procedural way to be based on patterns to be able to use the graph database engine. And of course we also want it to be realistically optimized.
So the key insight here is that many of these analytic tasks, they will involve actions over neighbors of nodes and doing the same action over and over again. So what do I mean with this? Now, imagine that we want to compute page rank. We will just compute in a simplified way where we just need to pass weights. So this is not, I don't care about normalization
or sync codes or anything like this here, but instead of computing the task as a whole, I'm going to focus on one particular node and then I'm going to repeat everything for all the other nodes. So we have this node, Anne. Anne has a current weight and what we need to do is
we need to propagate this weight over the neighbors. So we divide it by the degree and each neighbor gets the same share of the weight. And then what we do is that every node sort of reduces all the weights that receives from its own neighbors and then recomputes its
own weight like this. Again, this needs to be normalized and I'm simplifying lots of things, but the message here is that the way that this is done is by every node first sending a message
and then every other node grouping message together from the neighbors and doing some sort of reduction or some sort of aggregation. So it turns out that you can actually start thinking about how would such idea of a framework translate into a graph database system
which is based on patterns. So it would look like this. If you want to do a graph analytic queries you have to decompose it into successive applications of local patterns and by local patterns I mean that just one node will decide based on a
neighborhood of things and after each iteration a new graph will be constructed. So we're working on this as well. It's work in progress, but we have prototype language to do this in SparkQL, but there are much more options. Just to make it a bit more clear, again we're going to
be matching a pattern and what we're going to say basically is for each connected node this will be our pattern. I'm going to be aggregating this idea of the weight of my neighbor
divided by the degree and I will try to match this pattern over and over again for each node and then I'm going to aggregate all the solutions. So every match operation of this pattern would correspond to the message that is being sent by the node and then on top of this we need
an aggregation operation to group the message from these neighbors. But it's the same idea but now that we can actually specify everything with patterns. Now that we can specify everything with patterns we can of course invoke the graph system and just tell them well evaluate this
sort of in parallel over all nodes in the graph and continue the iteration. It turns out that you can actually speed up things. We are not at a phase yet where we can compare to a parallel framework but we are working on this.
Perhaps the most important comment that I can do at this point is that there are several processing paradigms for graph analytics and all of them look basically the same. They deal with this idea of message passing. So if you look at Pregel or
GraphLab or the systems that implement these ideas in each node actually they operate message and reduce functions. You define these functions for each node. The message function passes a message to the neighbors and the reduce function is an aggregation. There are actually
programming models such as Galois that they are actually based on nodes and neighbors. When again you just define functions over nodes and over neighbors and then you aggregate everything. The last paradigm to join this cloud is graph
neural networks but the only difference here is that now you are going to allow for learning these functions instead of defining them yourself. So you would deploy some neural network that actually learn what aggregating function and how are you going to be passing the message to your nodes but if you look from behind it's the same thing. Every node passes a message
and then you reduce everything. So we're experimenting with an idea of putting this but into a graph database system and compare with everything but this is what seems to be working. It seems that if you want to do graph analytics in
a fast way you need to make it so that you transform it into a message passing algorithm somehow. So there are a lot of things that are missing. This is just the tip of the iceberg here. Of course we need much more work on what is fast and what is not especially
regarding databases. We need of course languages for declaring analytic tasks. We have a prototype on sparkle but we're working on it. We're still working on it and it's not clear how you would define this language. Again it has to be less expressive than just hooking up
programming language to a database because you really need to optimize it somehow. I think one of the most promising directions here is that you can understand the relationship between all these other paradigms. So we have a paper in a machine learning conference
right now where we're comparing graph neural networks and a special form of graph pattern and there is a lot of work that can be done in this direction as well. The fourth item, this is something that I would really like to do but I have not even started
yet, is just to bring on GPU computation. The idea now is that you're going to have a language for the graph analytic tasks but this language will be somehow something which is a bit more or less expressive than the language is not going to be during complete.
So there will be some phase that you will be able to compute this analytic task into a database query that first will mount a certain amount of matrices and then the local processing, this
local pattern thing, the update, message passing and reducing thing, most of the time you can also express it as matrices operations and then this means that you can probably speed it up by using a GPU workflow. So this is something that you should be able to do as well. We have
not looked at it but of course this is very promising and there are lots of directions to push further research with this. So this is it. Again, I'm just going to conclude with final
words. I really think that there is plenty to do still in graph databases. I would really encourage you to start thinking on what are the real needs of systems. So if you would really need to start building a system including everything that has been done, how would you
orchestrate everything together? How would you actually make it work? Because in these intersections I think that there are a lot of new problems that we have to tackle if we want graph data to keep growing. So just focus on the need of systems and just to finish,
so again I hope that I was able to get you excited with all these new problems that I'm pointing you at. I'm happy of course to answer any questions and we'll see you around online in a few minutes. Bye.