We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Handling Billions Of Edges in a Graph Database

00:00

Formal Metadata

Title
Handling Billions Of Edges in a Graph Database
Title of Series
Number of Parts
95
Author
License
CC Attribution 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
The complexity and amount of data rises. Modern graph databases are designed to handle the complexity but still not for the amount of data.
Keywords
22
Thumbnail
54:22
27
29
36
Thumbnail
1:05:58
38
Thumbnail
1:00:58
65
Thumbnail
44:43
75
91
Thumbnail
1:21:58
94
Slide ruleLengthDirected graphProduct (business)DatabaseTable (information)Vertex (graph theory)Graph (mathematics)Direction (geometry)Relational databaseMetropolitan area networkResultantTheoryRootObject (grammar)Line (geometry)Attribute grammarWorkstation <Musikinstrument>Network topologyMessage passingCausalityCASE <Informatik>Connectivity (graph theory)Condition numberGraph (mathematics)Element (mathematics)Numbering schemeMechanism designRow (database)Pattern languageGroup actionMereologyClassical physicsPoint (geometry)Port scannerFilter <Stochastik>Figurate numberMatching (graph theory)1 (number)SubgraphEntire functionPattern matchingStatement (computer science)Web pageWordDifferent (Kate Ryan album)Range (statistics)Cross-correlationDistribution (mathematics)Query languageData storage deviceTraverse (surveying)Goodness of fitRule of inferenceWave packetProjective planeXMLComputer animationLecture/Conference
Graph (mathematics)Query languageOperator (mathematics)ResultantMultiplication signDatabaseTraverse (surveying)Database transactionObject (grammar)Data storage deviceView (database)Point (geometry)Different (Kate Ryan album)Graph (mathematics)ProgrammschleifeOpen sourceDirection (geometry)Vertex (graph theory)Projective planeProduct (business)Subject indexingProgrammer (hardware)FreewareCASE <Informatik>Complex (psychology)Exterior algebraLoginFacebookTheoryTable (information)BitConnectivity (graph theory)Hash functionCondition numberLogical constantLengthFlow separationTotal S.A.Filter <Stochastik>Rule of inferenceLevel (video gaming)LinearizationMathematical optimizationExponentiationMathematicsElement (mathematics)Interactive televisionTerm (mathematics)Relational databaseSelectivity (electronic)Depth-first searchNumbering schemeGroup actionDegree (graph theory)Statement (computer science)Network topologyScaling (geometry)IterationMereologySemiconductor memoryFormal languageReverse engineeringKey (cryptography)Endliche ModelltheorieComputer configurationBimodal distributionUltraviolet photoelectron spectroscopyEqualiser (mathematics)AlgorithmLecture/Conference
Graph (mathematics)Virtual machineMultiplication signRow (database)Scaling (geometry)Insertion lossSubject indexingVertex (graph theory)Set (mathematics)Different (Kate Ryan album)Graph (mathematics)Limit (category theory)QuicksortBitAttribute grammarPoint (geometry)Group actionCondition numberMereologyMatching (graph theory)Object (grammar)InformationPattern languageRevision controlMechanism designQuery languageSoftwareSemiconductor memoryFreezingIntegrated development environmentAlgorithmTraverse (surveying)NumberFigurate numberOperator (mathematics)Entire functionAreaHash functionMathematical optimizationResultantElement (mathematics)CASE <Informatik>SubsetService (economics)LinearizationFitness functionParameter (computer programming)VolumenvisualisierungFilter <Stochastik>Online helpUniformer RaumComplex (psychology)Address space2 (number)Goodness of fitServer (computing)LaptopDemo (music)Web browserUser interfaceLevel (video gaming)Computer animationLecture/Conference
Group actionDatabaseGraph (mathematics)Filter <Stochastik>BitDistribution (mathematics)Service (economics)Vertex (graph theory)Domain nameObject (grammar)Mereology2 (number)Virtual machineProbability distributionGraph (mathematics)Query languageRelational databaseProfil (magazine)Axiom of choiceFlow separationLevel (video gaming)SoftwareProduct (business)Category of beingServer (computing)Connectivity (graph theory)Mathematical optimizationResultantComplex (psychology)Office suiteAttribute grammarCoordinate systemControl flowCondition numberSingle-precision floating-point format1 (number)Equaliser (mathematics)Message passingMultiplication signMiniDiscOverhead (computing)PlastikkarteConnected spaceUniformer RaumBlock (periodic table)Limit (category theory)Revision controlKeyboard shortcutTerm (mathematics)CASE <Informatik>Local ringElectronic mailing listEnterprise architectureSubject indexingClosed setNear-ringReverse engineeringSocial classLogical constantOnline helpQueue (abstract data type)WordRandomizationScaling (geometry)Computational complexity theoryElement (mathematics)Fitness functionComputer animationLecture/Conference
Traverse (surveying)SummierbarkeitNumberVertex (graph theory)Message passingResultantInformation securityTask (computing)Software frameworkMeasurementTwitterQuicksortGraph (mathematics)Distribution (mathematics)AlgorithmDirected graphGroup actionFacebookDatabaseGraph (mathematics)AreaCoordinate systemServer (computing)ImplementationStapeldateiWeightRoundness (object)Latent heatPoint (geometry)Virtual machineSlide ruleBitMereologySequence diagramVapor barrierLogicQuery languageOverhead (computing)Extension (kinesiology)Process (computing)Keyboard shortcutCASE <Informatik>CausalityFlow separationCartesian coordinate systemMoore's lawMultiplication signSoftwareParallel portRight angleWordTheoryHuman migrationProjective planeComputer programmingComputer clusterExterior algebraService (economics)Centralizer and normalizerRevision controlWeb pageOcean currentData storage deviceRankingRepresentational state transferSingle-precision floating-point formatOpen setCore dumpPlastikkarteBefehlsprozessorGoodness of fitStability theoryLecture/Conference
DatabaseUniformer RaumVertex (graph theory)CASE <Informatik>Graph (mathematics)NeuroinformatikMultiplication signDemosceneAlgorithmSoftwareCartesian coordinate systemOperator (mathematics)Distribution (mathematics)Database transactionSoftware testingMechanism designMoment (mathematics)Representational state transferRevision controlProcess (computing)Device driverParsingPhysical systemMathematical analysisBenchmarkTheoremTheoryGraph (mathematics)Data modelHuman migrationNumbering schemeValidity (statistics)Point (geometry)Profil (magazine)Data storage devicePattern languageVideo gameDirected graphFreewareImplementationComputer chessCore dumpDifferent (Kate Ryan album)Group actionSoftware frameworkTable (information)String (computer science)AdditionMathematical optimizationSingle-precision floating-point formatService (economics)Server (computing)Water vaporMessage passingData centerElectronic mailing listQuery languageInstance (computer science)Attribute grammarEntire functionScaling (geometry)Client (computing)Focus (optics)Goodness of fitFile formatShortest path problemView (database)Replication (computing)Set (mathematics)Lecture/Conference
Computer animation
Transcript: English(auto-generated)
The slide you missed is just the one about me, so I talked about me, so never mind. What are graph databases? Who of you has already worked with the graph database? Oh, a couple. Good.
The other ones, do you know what graph databases are? What they are supposed to store? Then I will just give the introduction because only half of the group knows it. What are graph databases? Graph databases store so-called schema-free objects, which they name vertices. You can imagine them as one row in a relational table
with a difference that you do not say which attributes you have on each of those rows. Each of the vertices can have different attributes, if you like. Then they store relations between those vertices, which are called edges.
You can compare them with these many-to-many relations in a relational database. Importantly, edges have a direction. One of the vertices is the start and the other side is the target, or source and target. One example is on the left-hand side.
We have a couple of vertices, for example this one. The attribute is name, which is Alice, age 32. We have a Bob over here and we have a hobby over there. Those are the vertices. Then we have the relation, the connection between those. For example, Alice has a hobby called dancing.
We could store all these things in a relational world. No problem at all. The difference now is in the mechanisms that we have to query the database. First of all, edges can be queried in both directions. I can say I start at Alice and I only want to follow edges that point out of Alice.
Or I start, for example, at the hobby and I want to follow only edges that point into the vertex where I am. Or I can even say I don't care in which direction the edges point to. Just go as long as there is one edge.
You can do the same thing in a relational world. No problem. But you can also say, you can easily query a range of edges. So pick a random starting point and say please go two to five edges. One edge, two edge, three edges.
One edge, two edge, three, five. It's a one line and a graph database. For relational database, classical relational database, you need to do a lot of drawings. First you do a drawing with two steps, the second statement with three drawing steps,
then one with four drawing steps, one with five, and then the final one to put all of the results together. Even more, you can say I don't know how many edges I have to take. Please go as long as you find the first element that matches my filter conditions.
If you need one edge, two edge, or five hundred, I don't care. Just do it. You can even say, please pick two of those vertices and give me the shortest connection. One, two edges. That would be the result. In many cases, the shortest path is not deterministic
because there may be a lot of passes having the same length to find the same object. What are typical graph queries? Queries that graph databases perform very, very good at. For example, give me all the friends of Alice.
I start at Alice and I want to go one step and find all the connected vertices. Give me all the friends of friends of Alice. So now, starting at Alice, go two steps. But, everything that is reachable within one step should not be part of the result.
What is the linking path between Alice and Eve? We just pick two random vertices. I want to find the shortest path to them. As I said, this is only one possible result. Also possible would be Alice to Charlie to Eve.
We would have the same length. We need two edges in both cases. Or we can say, if we are standing at a train station, we buy a ticket. The ticket says, you can go up to six stations from now on and switch as many lines as you like. Where can I go? For a graph database, it is easy to find out that you can go up to here or over there
if you just follow those rules. The most commonly used query in a graph database is the so-called pattern matching. Pattern matching is that you describe a part of your graph, subgraph,
with some unknown components and you just match this pattern onto your entire graph database and find all these subgraphs that can fit those unknown components such that the pattern is matched. For example, please give me all the users that share two hobbies with Alice.
The pattern is, Alice has a relation to one hobby, has a relation to a different hobby, so those two are not equal, and then we need to find a friend that has a relation to this hobby and to that hobby as well. The friend should be the result.
Of course, we can make it extremely complicated. Give me all the products that at least one of my friends has bought together with a product that I already own. We have Alice has bought a product.
We need to find someone that is a direct friend to Alice who has bought the same product and has bought something else which Alice does not have a relation to because this product most likely is useful for Alice. We would like to recommend it to her.
What are non-typical graph queries? Queries that the graph databases could probably find an answer to, but they are not optimized for those. A short answer is everything that only is based on the attributes on the objects that you have.
That doesn't care for the relations between them. Graph databases are good when you search alongside the relations and are bad if you just need to look at the objects. A bad query would be give me all the users which have an age attribute between 21 and 35 because that would mean we need to do a full table scan.
As graph databases are organized differently than relational databases, this is even more costly than a full table scan in a relational world. Or give me the age distribution of all users, same reason. Group all the users by their name, even worse because I do the full table scan and then do some grouping afterwards.
The reason why this is so expensive and the other thing is so cheap is the query mechanism that is behind there. That is called a traversal. How this works I would like to explain by a simple example.
We are iterating down two edges, so two steps starting at the vertex, and apply some filters on each of the components that we find. What does the database do if you issue such a query? First of all, it picks a start vertex. Always needs one starting point for its query mechanism.
Then it takes a look at all the edges that are connected to this vertex. And of course finds out all the neighbors those edges are pointing to. Then we apply some filters on what we have just found and figure out, so C is not important, doesn't match the filters.
A and B are still okay. Now we have a result for depth one, but we said we want to go two edges. Next what happens is the graph database picks one of those by random
and does the same thing. For example we pick A, we have to look at all the edges connected to A and all the edge vertices. We apply the filters again and now we figure out that S to A to E
is actually a path of length two and all of those match the filters. So this thing would be returned. Can be some projection afterwards to just return the E element or whatever, whichever the user requested. But this is the thing the graph database has to compute to find the first result.
Now we have the first result. We have to check, oh, do we need to find edges of E? No, because we don't want to have the third path, so the third step in our path. So we go back to A and check, is there any edge we haven't processed yet? In this case there is no further edge that we have
because this one was thrown out by the filtering. So the database goes back to the starting vertex and checks if there is any edge not processed. In this case we find B and now apply the same schema. Check the edges, apply the filter conditions
and if they match we find an additional result. And what we just did is a so-called depth-first traversal. I'm starting at a point and I first go as deep into the graph as necessary to find the correct result. And then I travel back up.
An alternative is a so-called breadth-first traversal where I say I start at the vertex, then I first compute all vertices I have in depth one. And after I finish this, I'll start to compute all the vertices I have in depth two, no matter if they start at A or at B or at C.
And then depth three and so on. That of course changes the ordering of elements and allows for special conditions. So for example, if you remember the Friends of Friends, I can tell the algorithm, whenever you find something in the first step, do not visit it in the second step again.
So the graph database can easily find those Friends of Friends and ignoring the first level friends. Let's talk a bit about complexity. How expensive is that? First, we have to find the start vertex.
We have to pay this cost only once. And how fast we can find the start vertex depends on the indexes that we can use. Best case, we have an in-memory hash index with constant time lookup. Could also be a sorted index with log n lookup. If you don't have an index at all,
full table scan, O of n. Then, and the more important thing is, for every depth we have to do four steps. First of all, find all the connected edges. So we have a vertex in hand, what are the edges pointing out or pointing in into this vertex.
This is typically done with a so-called edge index, or index free adjacency. I'll come to the differences there later in this talk. Important here, this operation needs to be in constant time. Otherwise, the graph database cannot compete at all
because this operation is the one that it will always do the whole time. Now, we need to filter non-matching edges. And we can only do this by taking one edge at a time and checking if it matches the filter or not. So, linear in the amount of edges that we find.
Then again, we need to find the connected vertices. Again, that depends on the index we can use. Best case, we have an in-memory hash index, constant time. But again, once for each edge. And of course, we have to apply filters for those vertices as well.
All this can be done in a single path, so 3 times n costs linear complexity. Linear sounds evil? Who thinks so? Okay, linear actually sounds evil, but the difference is it is not linear
with the total amount of edges that you store. So, you can store billions of edges and still your simple traversals can be fast because the traversal only depends on those edges that are actually connected to each of your vertices. So, if you make sure you have only 20 edges connected to each vertex,
you can store billions of edges and still you can find all your results pretty fast. So, traversals only scale with the result size or the amount of documents they actually have to check. They are not affected by the total amount of data.
But, and this is an important but, the cost for traversal significantly increases with the search depths. Because the search depths means, when I find n many edges in depth 1, it means I have to do 3 times n step
for all those n edges again in depth 2. And thereby, the depth is actually the exponent of the complexity. And there is a scientific rule that for all naturally growing graphs, so social networks, everything where user interaction somehow connects those things,
you typically find a so-called 7 degrees of separation. And the rule means, pick any two random vertices. The shortest connection between those two is at most 7. I think in Facebook it is about 5 already.
Because in Facebook you have some friends which are not really friends too. But you just have more friends than you actually have. And thereby, the amount of edges increases and the difference between two vertices is smaller. If you don't believe me, just try it out on Facebook.
Pick any random person you don't know yet and try to figure out how many people are in between. This has an implication because whenever you have to formulate a query with a traversal depth between or above 7, it is actually cheaper to just dump out the entire database because we will take a look at it anyways.
So, a lot of theory. Now let's get a bit more our hands on some technology. I would like to introduce you to the open source project that I am working on. It is called ArangoDB and it is a multi-modal database.
Multi-modal database allows you to store key-value pairs, documents and graphs within the same technology. So, just one core. It is not relying on different databases underneath. It is just one thing. And we have a unified query language
that allows to query documents, graphs, key-value pairs. Even do joins across document collections. And all of the above can be combined in a single statement. So, I can start with a document query to find starting points for graph traversal and the result of the traversal can be joined with key-value lookups
if I like to. And ArangoDB has asset support including multi-collection transactions which is not typical for the so-called, for the NoSQL world. Let's talk a bit about the query language.
And the query language is inspired by a lot of other query languages that do not have this relational select star thing but more the programmer's point of view. So, we are doing iterators. We start with for loops and the query that we just see is the simplest query that we can imagine.
For every user object that we find in a user's collection, collection is the term in ArangoDB that defines a table. Logically grouped elements. Just return the element that we find. So, it's a simple select star from users.
Of course, we can apply filters. So, only return those users where the name is Alice. And of course, we will now find one object which is called Alice. And the optimizer of AQL is clever enough to figure out oh, there is this super awesome index stored on name.
It's probably a good idea to use it in this case. So, the query will be fast when indexed. Is smart enough to figure out breadth first or depth first? No, because that changes the result. So, you can do this with some options. But talking about depth first and breadth first,
this is how you execute a graph traversal. So, for product in, and then we give the direction we want to follow. Outbound, inbound, or any direction. Starting at the user, following has bought edges. And then just return the product.
So, in this case we would figure out that Alice sometime has bought a TD. And of course, the entire thing can be combined with filters. And even has some more return arguments. So, if I say up to three return arguments, the first one is the vertex,
the second one is the edge pointing to the vertex, and the third one is the entire path. So, in this case, I have three steps. Starting here, one, two, three. The PlayStation would be recommendation. This edge would be the action.
And the entire thing would be the path. And whichever you need in your result, you can just return to the user. On the path we can apply some filters. So, for example, the second vertex entry on the path should be in the same edge area as my user. So, we apply some filtering on this object.
And the optimizer figures out, as soon as the algorithm stands at this point, it checks this condition, and can decide, oh, this one doesn't match. So, we do not have to compute all the stuff that's going on here, so we can abort early. And of course, we can do some additional filterings
on the other returned objects. So, for example, the recommendation. And combine it with limits, or with skip, or whatever, sort, whatever we need. And then can just return the thing that we need. But now let's talk a bit about the challenges
that we have if we need to scale the whole thing. And the most common challenge is so-called supernotes. Because many graphs have celebrities. So, vertices with many inbound or outbound edges. So, for example, if you take Twitter,
and you look at the amount of followers that, for example, Barack Obama has. I think he just has one or two more than I have. But he is a so-called celebrity. So, whenever you do some traversals across Barack Obama, the entire traversal thing will take a lot of time. Because it has to linearly scan all the followers of him.
If you go above me, you just check my couple of hundreds that I have. So, it's pretty fast. So, what you need to do is optimize for those cases. Whenever you have this naturally growing supernotes. And in most cases, you do not need all the followers.
But you just need a subset of those. So, for example, the newest ten. And in order to find those easier, you can do, or faster, you can do a so-called vertex-centric index. Vertex-centric index allows to index vertices
together with arbitrary attributes on the edge. For example, a timestamp. You make this index sorted. You get, in logarithmic time, the newest follower of a certain vertex. And finding the next ten is just taking the next element in the index.
And that's it. And with this trick, actually we can figure out the initial set of edges that we need to look at in the linear step. Faster, and we have a reduced set of those. Thereby, we reduce this n and the complexity.
And thereby, it's faster. And that actually works. I will show you now in a live demo. So, please cross fingers. Yeah, browser already is a bit broken. Okay, looks good. So, this is ArangoDB's web interface.
What we see here, I have two collections. I can increase in size, just a second. One of those is a document collection. Documents are the vertices, in our case. And edges are connecting those documents.
Not a special dataset, nothing fancy. However, on the edges, I don't have any indexes defined. So I have the edge index, and I have a primary index. And I don't have any indexes on the documents as well,
but they don't matter for this query. So, I'll just do a simple query, starting at one of my vertices, following the edges, and just return anything that I find. But, important, I apply a filter condition on the first edge that I look at.
So, just execute this, and we get this result in 17 milliseconds on my local machine. Ten elements only, in memory, of course. So, not so important, but now I just go back, go to the edges,
go to indexes, and say, I would like to add a vertex-centric hash index, because I use equality, on from, because I'm doing outbound search, and value, which is the thing that I used my filter condition on.
Create it. There we go. Now go back to the query and execute the same thing again. 17 and a half milliseconds. Off we go. Half a millisecond, just because we could use the index, and we do not have to take a look at a lot of edges that we actually do not need in our query.
Doesn't help always, but it's a good start. And it only works if you have certain attributes that you filter on, on edge level. So, if you do so, make sure that you put some of the information into the edge as well.
However, this thing was running on my local laptop. That's not the topic of this talk. The topic of this talk is big data, challenge two. So, whenever we have a graph that is larger than what fits on my machine.
So, we store everything that we can, as long as we are allowed to. So, that means the data set easily grows beyond a single machine. So, even if we have a super powerful one. Our customers say, yeah, we do not talk to you if you don't scale up to 100 machines, because we cannot fit our data into a single one, we need 100 at least.
And of course, this includes graph data. Now let's talk a bit about scaling. So, it's pretty easy to distribute the graph on different machines, because we can just apply the general pattern, which is called sharding.
Sharding means we just take out our scissors and just cut the graph into pieces. And put those pieces on different machines. Pretty easy to do, but it gets hard to query.
Why? First of all, it's not possible to get a global overview of the graph on a single machine. Assumption is it doesn't fit. So, we cannot get it. And if we just cut the graph into different pieces, there may be edges between different servers.
So, not all data is stored locally anymore. And in a sharded environment, typically, or in most cases, network is the bottleneck. So, every network operation slows down the entire query. Because network is, of course, slower than local in-memory or even disk access.
So, we need to reduce the amount of network hops in our environment. Again, vertex-centric indexes can help with supernotes, but they only help for each machine. So, not everything is solved by putting a vertex-centric index on it.
Now let's distribute the graph. So, dangers of sharding, only part of the graph on every machine. Neighbouring vertices may be on different machines. And if you remember how the traversal works, that means whenever we are on a vertex
and we figure out, oh, our neighbour is on the other machine, we need to teleport over to the other machine, do some traversal stuff there, maybe teleport back. In some graph databases, for example, ArangoDB, even the address could be stored on a third machine, making it even worse.
So, we need a mechanism for queries to be executed in a distributed way and the results need to be merged locally. And now we have the choice of several distribution techniques.
So, how can we distribute the graph across the service that we have? The simplest thing is what I like to call a random distribution. So, I take a look at every object that I have in my graph, throw a dice, and depending on the result, I put it on one of those machines. The technical term here is consistent hashing of the key.
Advantage, every server takes an equal portion of the data, so I can easily estimate how many servers I'm going to need for my data. It's easy to realise, it doesn't require any knowledge about the data,
it just always works, which is a big plus and it's easy to set up. Disadvantages, the neighbours are most likely in different machines, probably edges are on other machines than their vertices, and a lot of network overhead is going to occur when we query
because the graph will end up like this on the different machines. No one can make up which were the connected components and whenever we query, the query executor just has to jump from here to here, back there, back there, back there, and so on. We'll be rather slow. We can do some optimisations there, but not too many.
However, it works and I'm just going to show you. Now I'm leaving my single machine and I'm going over to the cluster that is standing in our office in Cologne. I've imported a couple of datasets in here.
For now we will only focus on four of those collections. I'll have profiles random, relations random, profiles smart and relations smart. For now, profiles and relations random are the ones we go to.
The dataset that we imported is a social network called POCAC, which is available from Stanford University and contains the social data of some community. I can't recall.
The dataset has 1.6 million documents, which have a serious amount of data stored to them, so quite large. The relations are 30.5 million edges just connecting those profiles.
There are no further attributes stored on the edges. As you have seen by the name, or guessed by the name, this dataset was imported with the random distribution. I don't know where the profiles are located at.
I don't know where the edges are located at. Neither does the query optimizer. However, I can execute or I am able to execute a query on those things. The query that I'm using, I'm doing a three-steps traversal, applying some filter on level one and on level two. I only can apply filters on the vertices, edges don't have attributes,
and then just return whatever I find. I'm starting at one of those vertices and following the random distribution of the graph. Just increase those results, execute, and it takes a while.
Roughly one and a half seconds for 1,400 elements. I've taken a look at much more, because I applied some filter conditions somewhere and thrown out a lot of stuff. However, it works, and in some cases this might be enough.
But in most cases, it isn't. We need to do something which is a bit more clever than this. Yeah, we had that. First, let's talk a bit about index-free adjacency. Index-free adjacency is used by most other graph databases.
For example, Neo4j, which you may know. Which is the most well-known graph database nowadays. But Neo4j doesn't scale, so they have the hard limit that the entire DAF has to fit in disk of every machine in their cluster.
If that doesn't fit, Neo4j breaks down, won't help at all. That's their condition. The reason why they have this is, in my opinion, the index-free adjacency. Because it means that every vertex maintains two lists of edges, in and out, and they do not use an index to find those edges,
because they are physically stored next to the vertices. How could we chart this? Of course, parts of those are easy, because both sides of the edge are on the same machine.
But what about the last edge? In ArangoDB, we use a hash-based edge index, so constant-time lookup, which is, in complexity theory, equal time that we need. But this gives us the advantage that the vertex is actually independent of its edges,
and it can be stored on a different machine. In our case, we can just pick any of those machines and store the edge too. The other vertex then has to pay by a network hop. Now let's talk a bit about a more clever distribution of the data in a cluster.
What we have seen is that many graphs have a natural distribution that actually divides the graph into sub-parts, which are smaller, and where most of the edges are within the same part,
and rare edges go across those parts. For example, if you look at social networks, typically you have more friends from your local country or region than you have friends abroad. You have a few abroad, but most of them are local. For blocks, for example, blocks with the same tags are typically related.
Products with the same category are typically related. The features of these groups are most edges within the same group, rare edges between those groups. If we just use this knowledge about our domain
and actually show it by those groups, we can enforce that large parts of the graph, which is connected, end up on one physical machine. We can still make some sense of this part. And some other part is somewhere remotely.
The enterprise version of RangoDB uses this domain knowledge for shortcuts and for optimizations. Let's do this. Close this one. Just go back to the same data. Oh, I think I haven't talked about the setup, right?
My setup has three physical servers, three database servers. Those are responsible for taking the data, for storing the data, and two coordinators. Coordinators are user API, where you send queries to
and they figure out which database server is responsible for the data, pass the query to some optimizations. Each of those coordinators is on the same physical machine as one of the database servers. And of course I have used three shards for my collections.
I think I somehow lost connection now. Maybe it's the VPN, I had some trouble before. Just give it a second.
There we go. Back again. Same thing. We still have the same collections. Relations random, profiles random, profiles smart.
Now we are using the smart profiles. It's the entire same dataset, but with a different distribution. We have taken the region that is stored in here to group those people together on the same machine.
If I now execute the same query, but starting in the smart distributed dataset. The only things that I change. Execute the query and there we go, 200 milliseconds. Just because we can do some shortcuts and most of the graph
is stored on a single machine and we have less network hops. Next slide. How does it work and why does it work so good? First of all, this is the cluster setup that I have just talked about.
We have the coordinators, we have the database service and the graph is stored somewhere on the database service. Now we are querying for another long path. We can see here, it's a good start here. Then we need one network hop, another one back
because this one doesn't know that there's only one over there. Then we go back, back again. Already one, two, three network hops. Four, five, six to find this path. If we now do the distribution more clever
by just grouping those groups onto the same server, we still have one network hop, but only one. Then a large part of the graph are actually on the other database server, speeding up the query process of course.
Good. What I've just talked about was so-called point search use cases. I pick one specific point and I'm searching around a small area for this one specific point. Some other graph algorithms require to touch larger portions of the graph.
For example, the page rank algorithm that Google uses. That actually needs to contact all the vertices that are stored in the database because it needs to compute one score for all of those. Community detection.
Finding those groups that I did just use for the smart graph distribution. Or friend score, the thing that Facebook does when it shows you, oh, you probably are friends with those guys as well. Of course, traversal is not suited for those tasks because it would mean you first do a full collection scan
to find all the starting vertices and then one by one do the traversal process for those. Way too slow. We need an alternative and that is some batch processing algorithm. We need more parallelism. For traversals, we just do one local search, forget about it, do the next one.
For all the other vertices, we don't do anything. We need to parallelize this more. Of course, we want to use the computational power of all the machines that we have at the same time. Still, we need to limit the network overhead.
However, we need to accept that those algorithms are more or less offline because they typically need to do a lot of processing work on a large amount of data. Probably not in a millisecond result time.
The framework that we use there or we offer you there is called Pregel and it was initially developed by Google, initially for PageRank. The idea of Pregel is that one worker has to be defined and that runs on only one vertex at a time.
This worker can only see this one vertex and the outgoing edges and nothing else. Then it can send messages to other programs or to other workers. This has the advantage that all the workers are independent from each other.
In theory, I could start all of them in parallel. I don't have any overlapping data because I only see my vertex and its outbound edges, nothing else. I don't need any locks because I cannot write to anything that's not in my outgoing edges or in my own data.
Highly parallelizable. What is the idea? We can spawn many workers on all those servers. Ideally we spawn as many workers as we have CPU cores.
One server is the so-called master and tells the other workers, now do the execution please. The master can trigger several execution steps so it waits until all the workers are done and then can tell them, oh we are not quite there yet, please do another round of execution.
As soon as all the workers have reported they are done, he can start the next step or he can decide we are now there, let's terminate the algorithm and offer the result. Picture for the workflow.
The master has so-called super steps and he is contacting the workers. The workers are working independently and can send messages to one another. However, there is a global barrier. So in super step one I can only read messages from the step before, so no messages.
In step two I can read the messages from step one. In step three the messages from step two and so on. We have to wait until all workers are finished at this global barrier. Let's make an example worker.
First of all we take all the incoming messages and sum up the result. Initially this is zero because we don't have any incoming messages. Then we take a look at our vertex current score. Initially one divided by the number of vertices. We add the sum that we computed over here to the current score
and store the new thing as our new score. And then we send the new score divided by the number of edges to all our connected vertices. Then we say we are done with this step.
The master can say yeah, now please try again. From now we got messages from other servers, compute the sum and send out sum divided by edges to all the other connected vertices. This is the Patreon implementation. Typically we play it for 30 rounds
and we got quite stable Patreon distribution across the data. What we now find out is that vertices that have lots and lots of in-going edges get a very high score and vertices that have only lots and lots of outbound edges get a pretty low score.
This is no sneak peak anymore. ArangoDB 3.2 released a couple of weeks ago shipped with the first version of Prigga. We have support for batch processing in the graph area and we have pre-implemented algorithms for page rank,
for community detection and for so-called single source shortest path. I can say this is the start vertex, please give me the shortest path as to all the other vertices that you have and for some centrality measurements.
That's it. I still have 10 minutes to answer some questions. Thank you very much. If you like the project please follow us on Twitter and give us some stars on GitHub because we're an open source project. We live from GitHub stars. Thank you very much. Now I'm open for questions.
Yes, please. You mentioned that it's all schema-less, right? Yes. I still have an implicit schema. Can I have some kind of schema migration? The question is ArangoDB is schema-less,
however there is implicit schema. The question is now if we have some kind of schema migration from A to B. In the ArangoDB core there is no such thing because that is actually moved over to the application logic so that you can do several versions of the data.
But we have an extension which is called Fox where you can define a REST API on top of ArangoDB. Using this thing it has schema validation and this can actually do the schema migration on the fly. Whenever you request through the Fox API you do the schema validation, send out the new schema to the user
and store an updated version back to the database. But there is no built-in background process which says oh, please upgrade my schema now. You're welcome. Yes, please. Would that mean that you just run all your data
and then you run a new situation? Or what? The question is in the case that you are already running your application
you imported billions of data and then you end up with a bad distribution and you don't know how to continue. First of all, of course you could call us and we could help you, sure. But to help yourself you could for example run a community detection algorithm which probably takes a while for such a large amount of data
but then it ends up with a pretty good distribution. You would just re-chart your data into the better distribution and then your query should be faster again because then the feature is that most of the edges are in the same group. So that would be using the Prager framework that we released with the newest version.
Yes, please. So there is no live re-charting for changing the chart attribute but you could just migrate over to a new collection or dump restore the data with new charting. Of course that could be done in a running system but not as a background process using the same dataset.
That is something we would like to add in the future. Yes, please. Do queries still work?
So the question is how it handles concurrency and if you delete some edges if the queries still work. So, ArangoDB has some transaction mechanisms in the cluster so some guarantees, not fully asset in the cluster which actually prevent those things that you run into
something which someone else is just deleting at the moment. So what ends up if you delete an edge and you are coming with a different query to this vertex it will just not follow this edge because it is gone and then it depends on the order of execution. So on single servers, in this case the edge is located on a single server
we have this asset guarantee for the single instance but not for the entire cluster view. So it is clear if you can see the edge or not but maybe a bit non-asset across all servers. Yes, please.
Okay, so the question is about our roadmap to the future. So what we are working on is data center to data center replication and that includes an even better transaction handling in the cluster
because we need that. Then we will add... I don't know if I'm allowed to tell this. We will add an additional data model, so to say which allows for rather good full text searches and even more.
So that is something which is not yet fully defined but that will be added in the next versions. And of course we are doing some more optimizations in the cluster so our focus point right now is cluster scaling. Yes, please.
So we have compared with other graph databases of course. I only have an older version of the benchmark
So it is, to be honest, it is from 2015 I guess. It should be stated somewhere. There, 2015, end of 2015. Where we actually compared against all the other specialized solutions.
So as we did this benchmark we said we are a multi-model so probably we are not as fast in the certain aspects as the specialized solution and we just wanted to verify that. So we executed a lot of things that are pretty common operations in those engines.
And we actually figured out that we are not that bad as we expected. So we thought maybe we get to 120% everywhere. So 100% is the fastest, 120% would be us. We are a bit better than that in the case where we executed those tests. So for example if we are talking about the graph features
we have done shortest path, neighbors and neighbors including the data. And the first neighbors is not including the data so this is always I think neighbors two in this case. So neighbors of neighbors distinct. Shortest path computation, we did pick some random points
and just wanted to find the shortest path. What is this? So we are seven times faster than Neo4j and 22 times faster than OrientDB. For neighbors actually we implemented a client side neighbor search for MongoDB
which is two times slower than our implementation but it is still two times faster than Neo4j's built in single step implementation of neighbors. And if we include the data this all shrinks together because then parsing the data, shifting the data across the network
using the data set that I have just shown you takes over a significant amount of time. So there it is only one and a half times faster because we have some computational or some performance gain on the simple algorithm. But shifting a lot of data across the network then is the bottleneck more or less.
But yeah, just leave it like that. Yes please. A question about this benchmark. You mentioned previously that Neo4j sourced its neighbors immediately on the vertex itself. Yes. So do you have an explanation for how the neighbors search is so much slower than Neo4j?
So I think I have to make it clear again. So the neighbors in this case is neighbors of neighbors. So two steps. So which is named on the left hand side but not over here. And what we think or what I think is they store the vertex
and then the list of edges but not the other side vertex. So only the edge which knows oh we have to look over there and maybe this search over there is slower than we have it in our case. So finding the other vertices may be slower than we have. But I don't have that much inside, I just know the data format they have.
Okay, I just think I just, do I have time for one question? Okay, thanks. Another one, last one.
So the difference in, yes, yeah, okay. So the difference between neighbors and neighbors with profiles between ArangoDB and Postgres tabular. So the thing is neighbors with profiles is not the same amount of data. It is less many neighbors because it took a long time already.
And then I don't know why Postgres is actually faster, sorry. Maybe the client implementation that we used is just better in serializing, deserializing the data. So this is executed in Node.js with the official drive of the vendor.
Now all the databases were accessed by Node.js with their vendor offered driver. So maybe their parsing algorithm for the data was just better than ours. Which is just JSON parsing in Node.js.
Good, thank you very much. Whoever likes a t-shirt, I have a couple of those. So please come up front and you can pick them up. Thanks.