Handling Billions Of Edges in a Graph Database
Formal Metadata
Title 
Handling Billions Of Edges in a Graph Database

Title of Series  
Author 

License 
CC Attribution 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. 
Identifiers 

Publisher 

Release Date 
2017

Language 
English

Content Metadata
Subject Area  
Abstract 
The complexity and amount of data rises. Modern graph databases are designed to handle the complexity but still not for the amount of data.

Keywords  Databases 
Related Material
The following resource is accompanying material for the video
Video is cited by the following resource
00:00
Classical physics
Point (geometry)
Slide rule
Group action
Length
Connectivity (graph theory)
Direction (geometry)
Range (statistics)
Workstation <Musikinstrument>
1 (number)
Port scanner
Numbering scheme
Mereology
Theory
Element (mathematics)
Product (business)
Attribute grammar
Mechanism design
Crosscorrelation
Pattern matching
Causality
Root
Lecture/Conference
Different (Kate Ryan album)
Metropolitan area network
Condition number
Distribution (mathematics)
Graph (mathematics)
Matching (graph theory)
Relational database
Graph (mathematics)
Database
Subgraph
Line (geometry)
Entire function
Message passing
Computer animation
Personal digital assistant
Network topology
Statement (computer science)
Vertex (graph theory)
Pattern language
Object (grammar)
Table (information)
Resultant
Directed graph
Row (database)
07:42
Logical constant
Complex (psychology)
Group action
Length
Multiplication sign
View (database)
Equaliser (mathematics)
Direction (geometry)
Numbering scheme
Set (mathematics)
Insertion loss
Parameter (computer programming)
Mereology
Traverse (surveying)
Formal language
Subset
Programmer (hardware)
Facebook
Mechanism design
Mathematics
Semiconductor memory
Different (Kate Ryan album)
Computer configuration
Core dump
Endliche Modelltheorie
Area
Algorithm
Data storage device
Bit
Database transaction
Flow separation
Degree (graph theory)
Hash function
Linearization
Figurate number
Quicksort
Resultant
Row (database)
Reverse engineering
Filter <Stochastik>
Web page
Point (geometry)
Open source
Connectivity (graph theory)
Ultraviolet photoelectron spectroscopy
Theory
Element (mathematics)
Product (business)
Attribute grammar
Lecture/Conference
Term (mathematics)
Operator (mathematics)
Bimodal distribution
Mathematical optimization
Condition number
Graph (mathematics)
Scaling (geometry)
Matching (graph theory)
Key (cryptography)
Graph (mathematics)
Exponentiation
Projective plane
Database
Total S.A.
Depthfirst search
Limit (category theory)
Subject indexing
Word
Personal digital assistant
Query language
Network topology
Vertex (graph theory)
Statement (computer science)
Iteration
Object (grammar)
Table (information)
Freezing
22:51
Group action
Information
Graph (mathematics)
Virtual machine
Attribute grammar
Element (mathematics)
Subject indexing
Hash function
Lecture/Conference
Personal digital assistant
Semiconductor memory
Volumenvisualisierung
Vertex (graph theory)
Resultant
Condition number
25:23
Probability distribution
Server (computing)
Randomization
Service (economics)
Overhead (computing)
Connectivity (graph theory)
Virtual machine
Mereology
Traverse (surveying)
Number
Element (mathematics)
Attribute grammar
Revision control
Mechanism design
Lecture/Conference
Semiconductor memory
Different (Kate Ryan album)
Term (mathematics)
Profil (magazine)
Operator (mathematics)
Queue (abstract data type)
Office suite
Mathematical optimization
Social class
Condition number
Distribution (mathematics)
Graph (mathematics)
Scaling (geometry)
Relational database
Graph (mathematics)
Fitness function
Plastikkarte
Bit
Entire function
Subject indexing
Word
Integrated development environment
Software
Personal digital assistant
Query language
Vertex (graph theory)
Pattern language
Object (grammar)
Figurate number
Resultant
Reverse engineering
32:45
Logical constant
Group action
Multiplication sign
Equaliser (mathematics)
Mereology
Traverse (surveying)
Computer programming
Facebook
Roundness (object)
Software framework
Social class
Area
Enterprise architecture
Algorithm
Block (periodic table)
Relational database
Closed set
Keyboard shortcut
Fitness function
Electronic mailing list
Coordinate system
Parallel port
Bit
Flow separation
Category of being
Message passing
Process (computing)
MiniDisc
Configuration space
Summierbarkeit
Quicksort
Resultant
Point (geometry)
Slide rule
Server (computing)
Service (economics)
Overhead (computing)
Vapor barrier
Connectivity (graph theory)
Virtual machine
Control flow
Online help
Theory
Product (business)
Number
Twitter
Revision control
Moore's law
Latent heat
Lecture/Conference
Profil (magazine)
Sequence diagram
Mathematical optimization
Condition number
Task (computing)
Domain name
Distribution (mathematics)
Stapeldatei
Scaling (geometry)
Graph (mathematics)
Weight
Graph (mathematics)
Database
Computational complexity theory
Subject indexing
Word
Software
Query language
Personal digital assistant
Vertex (graph theory)
Nearring
45:17
Point (geometry)
Server (computing)
Group action
Service (economics)
Multiplication sign
Numbering scheme
Theory
Twitter
Revision control
Mechanism design
Roundness (object)
Causality
Lecture/Conference
Core dump
Software framework
Extension (kinesiology)
Information security
Mathematical optimization
Physical system
Addition
Algorithm
Stapeldatei
Distribution (mathematics)
Graph (mathematics)
Validity (statistics)
Graph (mathematics)
Moment (mathematics)
Projective plane
Data storage device
Database transaction
Database
Representational state transfer
Cartesian coordinate system
Measurement
Demoscene
Human migration
Data model
Message passing
Process (computing)
Logic
Personal digital assistant
Vertex (graph theory)
Data center
Video game
Freeware
Directed graph
52:00
Computer chess
Point (geometry)
Implementation
Parsing
Multiplication sign
Device driver
Water vapor
Neuroinformatik
Revision control
Different (Kate Ryan album)
Profil (magazine)
Singleprecision floatingpoint format
String (computer science)
Operator (mathematics)
Theorem
Software testing
Algorithm
Graph (mathematics)
Graph (mathematics)
Mathematical analysis
Electronic mailing list
Database
Benchmark
Computer animation
Software
Uniformer Raum
Personal digital assistant
Vertex (graph theory)
Pattern language
Table (information)
Directed graph
00:07
this slide you missed it's just the 1 about me so talked about me so their mind and what our graph databases who you has already worked with the graph database over couple that they're the ones that you know what graph databases are what they are strong OK so then I will just give introduction on behalf of the group notice it so what are graph databases graph databases for socalled scheme of objects the which they name vertices and you can imagine them else 1 road in a relational table with the difference that you do not say which attributes you have on each of those rows so each of vertices can have different attributes if you like then they store relations between those vertices which are called edges and you can compare them with these many to many relations in a relational database is a guanine edges have direction so 1 of the vertices is the start and the other side is the target or source and target 1 example is on the left hand side we have couple of vertices for example this 1 at the attribute its name which is Alice age 32 to we a ball over here and we have a hobby over there the also the vertices and then we have the relation the connection between those so for example Ellis has a hobby court and we could stall these things in a relational world no problem at all the difference now is in the mechanisms that we have to queue database 1st of all edges can be carried in both directions so I can say I started Ellis and I only want to follow edges that point out of this or I start for example at the hobby and I want to follow the edges that point into the vertex set where why can't even say I don't care in which direction the edges point 2 just go as long as this 1 edge that can the same thing racial what the price but you can also say you can easily read range of so pick a random starting point and say please go to fight edges 1 match to which created 1 edge to edge 3 5 the 1 line in the graph database correlational database classical relational database you need to do yeah drawings 1st to do a join with 2 steps the 2nd statement with 3 joints steps then 1 is for joint steps 1 with 5 and finally want to put all the results to go even more we can say under normal many edges I have to take please go as long as you find the 1st element that matches the conditions which if you need 1 edge to edge or 500 I don't care just the man and can even say please pick 2 of those vertices and if the shortest connection 1 2 edges that would be the result In many cases the shortest path it's not a terministic cause there may be a lot of passes having the same length 0 and to find the same the what a typical graph trees creates a graph databases perform very very good at for example give me all the friends of analysts I started Ellis and I want to go 1 step and find all the connected vertices the give me all the friends of friends of so now starting at Ellis go 2 steps but everything that is reach reachable within 1 step should not be part of the result what is the linking passed between Alice and Eve so just pick 2 random vertices i wanna find the short and as I said this is only 1 possible result also possible would be others to trolley to you would have the same length we need to edges in both cases the all we can say if we're standing at the train station we buy ticket and the tickets says you can go up to 6 stations from now on and switch as many lines as you like where can I go graph this is it's easy to find out that you can go up to here or over there if you just some those roots but the most commonly used theory in a graph databases a so called pattern matching pattern matching is that you describe a part of the graph subgraph with some unknown components and you just matched this pattern on to the entire graph database and find all the subgraphs that can fit those unknown components such that the patterns match I so for example peace give me all the users the chair 2 what was with and so the pattern is a this the has a relation to 1 hobby has a relation to different hobby so those 2 are not equal and then we need to find a friend that has a relation to this hobby into that hobby as well and the friendship with the result then of course you can make it extremely complicated give me all the products that at least 1 of my friends has bought together with a product that I already own so we have in this i has bought a product we need to find someone that has a direct French to others who has bought the same product and has bought something else which does not have a relation because this products most likely is useful for Alice so would like to recommend it to her what amounted to graph trees the so clear is that the graph databases could probably find an answer to but they are not optimized for those and short answer is everything that only is based on the attributes on the objects that you have that doesn't care for the relations between them the graph databases are good when you search the relations and I bet if you just need to look at the objects the so bacteria would be give me all the users which have h tribute between 21 and 35 because there was a need to do a full table scan and as graph databases are organized differently than relational databases this is even more costly than a footage in relational 1 or give me the age distribution of all users the same reason group all the users by the name even worse because that are the forte scan and then you some
07:42
grouping after the reason why this is so expensive and the other thing is so cheap is the crew mechanism that is behind their that is called a how that this work select explained by simple example so we iterate and down 2 edges so 2 steps starting at a word text and apply some pages on each of the components that we find what does the database do if he issues such a 1st of all it picks a stop attacks always needs 1 starting point for its 3 mechanism then it takes a look at all the edges that are connected to this vertex and of course find out all the neighbors those edges are pointing to then we apply some filters on what we just found the figure out so C is not important as matched filters AMP are still OK now we have a result for that's 1 but we said wanna go to edges next what happens is the graph database picks 1 of those by random and thus the same thing so for example we pick a we have to look at all the edges connected to a and all the vertices and we apply the fittest again and now the figure out that S to a to the this actually paths of length 2 and all of those matches match the fittest so this thing would be returned can be some protection afterwards to just return the element or whatever driven user request that but this is the same the graph database has to compute to found the 1st result now we have the 1st result we have to check all do we need to find edges of the no because we don't want to have a search path so the 1st step in a time so we go back to a and check is then edge we haven't processed yet In this case there is no further action that we have because this 1 was thrown out by the so the database goes back to the starting vertex and checks if there is any action a process and in this case we find and now apply the same scheme the check the edges apply the feeder conditions and if they match refined and additional results and what we just a bit is a socalled depth 1st traversal I'm starting at a point and at 1st go as deep into the graph as necessary to find quick results and then I travel back not into the so called breadth 1st traversal when I say I start as the work at the vertex then I 1st compute all vertices i have and that's 1 and after I finish this the I'll start to compute all the vertices i have indexed to no matter if they start at 8 or B or at sea and then get 3 and so on that of course changes the ordering of elements and allows for special conditions so for example if you remember the friends of friends the I can tell the algorithm whenever you find something in the 1st step look with it in the 2nd step again so I'll be graph database can easily find those friends of and ignoring the 1st of these let's talk a bit about complexity how expensive is the 1st we have to find the start vertex we have to pay this cost only once and how fast we can find the start vertex depends on the next so that we can use the best case we have fit in memory hash index was constant time look up could also be a sorted indexed with what kind of if we don't have an index at all full table scan what then and the more important thing is for every depths we have to do 4 steps 1st of all find all the connected edges so we have a vertex in hand what are the edges pointing out for pointing in into this vertex this is typically done with a so called edge index or indexfree adjacency I'll come to the and differences there H and this important here this operation needs to be in constant time otherwise the graph database cannot compete at all because this operation is the 1 that it would always do the whole now we need to feudal monarchy nonmatching edges and we can only do this by taking 1 edge at a time and checking if it matched the feature not so constant indium uh and of linear in the amount of edges that we find then again we need to find the connected vertices again that depends on the index we can use the best case if in memory hash index constant but again once for each edge and of course we have to apply futures for those 8 and vertices as well all this can done can be done in as your path so 3 times and costs some linear complexity there are some people who think that the of K the actually sounds evil but the difference is it is not linear with the total amount of edges that you store so you can store billions of edges and stereo and simple traverse of can be fast because the Traverso only depends in those edges that actually connected to each of you vertices so if you make sure you have only 20 edges connected to each vertex it can store billions of edges and still you can find all your results pretty fast so corrosives only scale was the result size for the amount of documents they actually have to check they are not affected by the total amount of data but and this is the important part because what proposals significantly increases with the search steps because the such that means when I find and many edges in the what it means that the speed and the for all those and edges again in that's to and thereby the depths is actually the exponent of the complex and as a scientific rule that for all naturally growing graphs so social networks um everything where user rejection somehow connects those things it typically finds a socalled 7 degrees of separation and the remains pick any 2 random vertices the shortest connection between those 2 is at most 7
15:17
I think in Facebook it is about 5 already because in facebook you have some friends which are not really friends to but you just have more friends than can and thereby the amount of edges increases and the difference between 2 vertices is small the if you don't believe me just tried out of Facebook take any random person you don't know yet and try to figure out how many people in between and this has an implication because whenever you have to formulate a query was a traverse the depths between or about 7 it's actually cheaper to just dump of the entire database because we take a look at it anyways but the so a lot of the knowledge get a bit more our hands on some technology and I would like to introduce it to the open source project that I'm working on it's called orangutan he's and it's a monte model database multimodal database allows it to store key value pairs documents and graphs within the same technology so just 1 4 it is not relying on different databases as underneath just 1 thing and we have a unified view language that allows to query document graphs keyvalue pairs even do joins across document collections and all of the above can be combined in a single statement so I can start with a document we really is to find a starting point for graph traversal and the result of the tree was a candy joint rescue that you look ups if elected and the and the behest as support Monte collection transactions which is not typical for the so called for the nose equal work let's talk a bit about the few language and crew languages inspired by a lot of other Q anguages that do not have this relational select star think about more the programmers point of view so we're doing iterators we started with for loops and clear that we just see is the simplest theory that we can imagine for every user object that defined in the user's collection collection is the term entering to be that defines a table logically grouped elements just returned the element that before so it's a simple select from users course we can apply filters so only return those users where the name is Alice and of course we will now find 1 object which is called Ellis and the optimize of L is clever enough to figure out a way there's this soup also next stored on name probably a good idea to use it in this case so the crew will be fast when index is the mark your and no because the changes the result so you can do this with some options I am not talking about a depthfirst and breadthfirst this how execute a graph reversal so for product in and then that gives the direction wanna follow outbound and inbound bring your action starting at the user following has bought edges and then just return the product so in this case you would figure out that Ellis sometime has bought a the a and of course the entire thing can be combined with features and even has some more return so if I say 3 reach up to 3 return arguments the first one is the vertex the 2nd 1 is the edge pointing to the vertex and the 3rd 1 is the entire path so in this case we have 3 steps year 1 2 3 the PlayStation would be recommendations this action would be the action and the entire thing would be the past and whichever you need in your result you can just returned to these on the path we can apply some features so for example the 2nd vertex entry on the past should be in the same age area as my user so we apply some filtering on this object and the optimizer freeze out as soon as the algorithm stands at this point the checks this condition and can decide 0 this 1 doesn't match so we do not have to compute all the stuff that's going on here so we can abort only and of course we can do some additional featuring on the other returned objects so for example the recommendations and combine it with limits or whiskey or whatever sort of where we need and I can just return the thing that the but now let's talk a bit about the challenges that we have if we need to stay don't think and the most common challenge a socalled supernotes because many graphs have celebrities so vertices with many inbound or outbound so for example if you take the time did you look at the amount of follows that for example Barack Obama has I think it just has 1 or 2 more than I have but he the socalled celebrities so whenever you do some 2 rows across Burke along the entire losing will take a lot of time because it has to linearly scan all the followers of if you go over me just check my a couple of hundred that I have so it's pretty fast this so what you need to do is optimized for those cases whenever you have this naturally growing supernodes and in most cases you do not need all followers but you just need a subset of those so for example the news 10 and in order to find those easier you can do a faster you can do so called vertexcentric vertexcentric index allows to index vertices together with arbitrary attributes on the edge for example attempts to you make this index sorted you get a logarithmic time the used follower of a certain vertex and finding the next 10 is just taking the next element in and was this trick actually can figure out these initial set of edges that we need to look at In the linear step Foster and we have a reduced set of those thereby reduced this N and the complexity of and there were faster good and that this was that this actually works I will show you now in left so please crossed
22:54
fingers yeah Bros already the broken of it so this is a render the that interfaces what we see here I have 2 collections can increase in size to the 2nd the 1 of those is a document collection of documents the other vertices in our case and edges are connecting those documents no special dataset nothing fancy and Harvard on the edges I don't have any indexes defined so I have the x and x and I have a primary and its the and I don't have any indexes on and the documents as well but they dozen don't matter for this so I just do simple starting at 1 of my vertices following the edges and has return everything that I find but important I apply a filter condition on the 1st edge so just execute those and we get this result in 70 ms on my local machine 10 elements of the in memory of course so not so important but I just go back go to the edges so to indexes and say I would like to add the vertexcentric hash index because I used quality in the the on from because I'm doing out on search and value which is the thing that I used for the conditional created there we go check now go back to the and execute the same thing again 17 and a half millisecond the off we go half millisecond just because we could use the index and we do not have to take a look at a lot of edges that the action do not need to know the we does not always but it's a good start and it only works if you have a certain attributes such a creature on 1 action so if you do so make sure that you put in some of the information into the edge as well
25:24
however this thing was running on my local at that's not the topic of this talk the topic of this talk is big data challenge to so whenever we have a graph that is laughter then what fits on my machine so we store everything that we can yeah as long as we are about to so that means the dataset easily grows beyond a single machine so even if we have a superpower for 1 our customers say do not talk to you if you don't scale up to 100 machines because we cannot fit our data into a single 1 you 100 at least and of course this includes grafted and number talk a bit about scaling so create using to distribute the graph on different machines because we can just apply general pattern which is called having the trotting means we just take out um take out our and how is it called so this and just cut the graph into pieces and put those pieces on different machines pretty easy to do but it gets hotter why 1st of all it's not possible to get a global overview of the graph on the same machine the assumption is that doesn't fit so we can get it and it's if we just cut the graph into different pieces there may be edges between different service so not all data is stored locally anymore and in a shot of environment typically or in most cases network is the bottleneck so every network operation slows down the entire query because network this of course slower than local in memory so you can discuss success so we need to reduce the amount of networks hops in our environment again what eccentric indexes can help with supernotes but they only help for each machine so not it's not everything is solved by putting a vertexcentric in intercom the now let's the graph so dangers of only part of the graph on every machine neighboring vertices may be on different machines and if you remember how that traverse the works that means whenever we on a vertex in the figure out all our neighbors on the other machine we need to teleport over to the other machine who some throws of stuff there may be teleport back evidence In some graph databases for example Arango DB even the edges could be stored on the 3rd version making it even worse so we need a mechanism for queries to be executed in a distributed way and the results needs to be merged looking and now we have the choice of several this the distribution techniques so how can we distribute the graph across the service that we have the simplest thing is what I like to call a random distribution so take a look at every object that I have in my graph throw dice and depending on the result I put it on 1 of those machines Dana term consistent hashing of the key advantage every server takes an equal portion of the data so I can easily estimate how many service I'm gonna need for my data it's easy to realize it doesn't require any knowledge about the data it just always works which is a big class and it's easy to set up disadvantages the neighbors are most likely in different versions probably edges are the machines and the vertices and a lot of network overhead it's going to a queue when the query because the graph will end up like this on the different no 1 can make up OK which words connected components and whenever we can you read the clearer and executed has just has to jump from here to here but there but there but there and so on rather slow we can do some optimizations there but not too many however it works and I'm just going to show you the so now I'm leaving lights in the machine and I'm going over to the cluster the spending in our office in Cologne and I've imported a couple of datasets in here and OK that's the uh for
30:25
now we will only focus on 4 of those connections so all have profiles random relations profile smart and relations smart for now profiles and relations random I want to go to the dataset that we imported is social network called Kukec which is available that from Stanford University no and contains the social data off uh some key community I think it's shaped by that I in a chemical the dataset has 1 . 6 million documents which have I serious amount of data stored so quite large and to the relations of 30 and a half million edges just connecting those profiles there no presented to be stored on the edges and as if seen by the name of guessed by the name this dataset was important with the random distribution so I don't know whether profiles stated that I don't know where the edges are located at neither does the query optimizer the however I can execute or I'm able to execute each query on those things so pretty that I'm using I'm doing a threesteps reversal applying some future another 1 and another to we can apply it is on the vertices it is don't have attributes and then just return and whatever I find the and I'm starting at 1 of those vertices and following the random distribution of the graph just increase those results execute and it takes a while so roughly 1 and a half 2nd for 1400 elements have taken look and much more because appetites appeared a condition somewhere and thrown out a lot of stuff however it works and in some cases this might be enough but in most cases it is so we need to do something which is a bit
32:47
more clever than this yeah have prohibit but 1st lets the talk a bit about indexfree adjacency so indexfree adjacency is used by most other graph databases for example near for j which you may know and reduce the most well known graph database nowadays but new for J doesn't scale so they have the heart limits so that the entire Darfur has to fit in disk of every machine in the cluster if that doesn't fit you for j breaks down won't help at all that's the condition the reason why they have this as in my opinion the indexfree adjacency because it means that every vertex maintains 2 lists of edges In and out and they do not use an index to find those edges because they're physically stored next to the vertices How could be shot this of course parts of those are easy because both sides of the edge are on the same machine but what about the entering the DP use a hashbased edge index so constant time lookups which is in complexity theory equal time that we need yeah but this gives us the advantage that the vertex is actually independent of its edges and it can be stored on a different machine so in our case we can just pick any of those machines and so the and the other vertex then has to pay by and on behalf you now let's talk a bit about a more clever distribution of the data in the class what we seen is that many graphs have a natural distribution that actually divides the graph into subparts which are smaller and for most of the edges I was in the same part and gray edges go across those parts for example if you look at social networks typically you have more friends from you look a country or region then you have friends that brought you have a few brought but most of them are located 4 blocks for example blocks with the same text activity related products with the same category are typically related the the so the features of these groups are most edges within the same group edges between those groups and if we just use this knowledge about our domain and actually hard by those groups we can enforce that large parts of the graph which is connected end up on 1 physical machine so we can still make some sense of this part and some other parts as some their remote and the enterprise version offering the DB uses this domain knowledge for shortcuts and for optimizations so let's do this the close this 1 just go back to the same data 0 I think I haven't talked about the set up right so my set up has 3 physical servers 3 database service those are responsible for taking the data for storing the data and to Corey notice coordinators are user API where you centuries to and they figure out which database server is responsible for the data past the create some optimization and 2 of those yeah so each of those caudatus on the same physical machine is 1 of the databases and of course I have used 3 shots for my collection and I think I somehow lost click connection I maybe the VPN I had some trouble before just give it a 2nd so it is not a theory and in its there we go so again something so we still have the same collections so relations random uh and propose random profiled smart and now we're using the smart profiles of the entire same dataset but with a different distribution so we have taken the the region that is stored in here to group those people together on the same machine and if I now execute the same query but starting in the smartest distributed press dataset so the only thing that I change execute the query and there we go to 100 least just because we can do some shortcuts and most of the graph is sport on the same machine and we have less than 4 crops good so next slide how does it work and why does it works a
38:38
bit so 1st of all this is the cluster set up that have just talked about so we have the coordinators we have a database servers uh and the graph is stored somewhere on the databases and now we experience for us along path the and we can see here it's good start here and then we need 1 had net for copper another 1 back because this 1 does know or this 1 does know that's only 1 over there then we go back but again so already 1 2 3 network hops 4 5 6 to find this part of and if you know to the distribution more clever by just grouping those groups onto the fact that the we still have 1 that for but only 1 and then not not part of the graph find a actually on the other databases speeding up the query process of course but what I just talked about was so called search use cases so I pick 1 specific point and I'm searching around a small area for those uh for this 1 specific point left some other graph algorithms required to touch larger portions of the graph for example the pay trend algorithm that reduces the that actually used to couldn't no contact all the vertices that are sort in the database because it needs to compute once more for the community detection so finding those groups that have it just that I did just use for and this mock graph distribution warfarin score the thing that Facebook us when it shows you old you probably are friends with those guys as well and of course traversal is not suited for those tasks cause that would mean you 1st do full collection scanned to find all the starting vertices and then 1 by 1 the traverse a process for this way too slow so we need to know iterative and that is some batch processing numbers so we need more parallelism so for proposals we just do 1 local search forget about it to the next 1 and for all the other vertices we don't do anything so we need to paralyze this more of course we want to use the computational power of all the machines that we have at the same time and still we need to limit the network overhead however we need to accept that those algorithms are more or less because they typically need to do a lot of processing work on large amount of data so probably not in millisecond result and the framework that we use they're all we offer you there is called Cregan and it was in air initially developed like with initially for Patrick and and the at here of previous is that 1 worker has to be proper defined and that runs on only 1 vertex at a time and this work can only see this 1 vertex and the outgoing edges and nothing else and then it can send messages 2 other programs or to other words this has the advantage that all the whiskers are independent from each other so in theory I could start all of them in parallel I don't have any overlapping data because I only see my vertex and its outbound edges nothing else and I don't need any locks because I cannot write to anything that's not in my but my outgoing edges on my data so highly parallelizable so what is the idea we can spawn many workers on all those servers it ideally respond as many workers as we have as if course 1 7 it's a so called Master and tells the other workers now do the effective please you must configure several execution steps so it waits until all the workers are done and then can you can tell them all we're not quite there yet please do another round of execution so as soon as all the workers have reported had done we can start the next step or you can decide we are not there that's terminate the algorithm and over the result so picture for the workflow the master has socalled supersteps and he's contacting the workers so the workers are working independently and can send messages to 1 another however there's a global barrier so in superstep 1 I can only read messages from the step before so no messages in step 2 I can read the messages from step 1 in step 3 messages from step 2 and so on and we have to wait until all workers of finished as at the school with area let's make an example work 1st of all we take all the incoming messages and some of the results initially the 0 because we don't have any incoming messages then we take a look edge overtakes current score initially 1 divided by the number of we at the the sum that the computed over here to the current score and strong the new thing as on you score and then we sent the new score divided by the number of edges to all our connected vertices the and then we say we are done with the stuff and you must look and say yes or no please try again and from now we got messages from other servers to compute the sum and send out and some divided by edges to all the other connected vertices this is the patient implementation and
45:19
typically be played for 30 rounds and we got quite stable patient distribution across the data OK and what we know find out is that vertices that have lots and lots of ingoing edges get a very high score and vertices that have only lots and lots of outbound edges and a pretty low score so there's no sneak peek anymore so bring to be 3 . 2 released a couple of weeks ago ship was the 1st version of so we have support for batch processing in the graph Ariel and we have free implemented algorithms for they trained for community detection and for so called single source shortest path so I can say this is the start vertex please give me the shortest path to all the other vertices that you have and for some security measurements the so that this was the last 10 minutes to answer some questions thank you very much if you like the project please follow us on the top on Twitter and give us some stars on the up been opensource projects so we live from it stars and the rational for questions b and if yes please of human right yes so like this and yeah so so the question is surrendered schemaless however there is the implicit schema and then the question is now if we have some some kind of schema migration from a to B um so in the brain can be cause there's no such thing because that is actually moved over to the application logic so that you can do several versions of of the data but we have an extension which is called Fox but you can define the REST API on top offering the so he was using this thing it has schema validation and this can actually do this schema migration on the fly so whenever you request through the fox API is a scheme of validation send out the new schema to the user and store an updated version back to the think of but there's no built in background process which says so please upgrade masking masculine what this piece of land and the a lot of the time that the ground and out of we have so it has yeah so so the the question is isn't indicates that you've already running your the application you imported because of data and then you end up with a better solution and you don't know how to continue so 1st of all Of course you could call us and we could help you sure that and to help yourself you could for example is run a community detection algorithm which probably takes a while for such a large amount of data and but then it ends up with a pretty good distribution so you would just reshot the data into the better distribution and then decrease should be faster again because then the features that most of the edges on in the same group so that would be using the previous framework that we reduce the thing version yes this is um so there is no life free sharding for changing the shot after due but you could just uh migrate over to new collection or dump restore the data with you shouting of course that could be done in a running system but not in the back as a background process using the same dataset that is something we would like to add in the future yeah that's what I call it the the theory in the the yeah do you more so the question is uh how it handles concurrency and if you delete some edges so if we can still work so well in a in the DB has some transaction of and mechanisms in the cluster so some guarantees not fully as that in the past um rejected prevent those things that you run into something which someone else is just eating at the moment so what ends up if you do and just coming with that differently to this vertex if we just not follow this edge because it's gone and then it depends on the order of execution so um on scene of service in this case the edges located on a single server uh we have to set as it guarantees for the 6 single instant but not for the entire custody so it's clear that if you can see the edge or not but maybe maybe that's message across all this this the OK so the question is about our roadmap for the future so what we are working on is um datacenter data reputation and and that includes an even better transaction handling in the cluster because in that early then we will at I don't know if I'm allowed to tell this up and we will act an additional data model so to say which allows for a fulltext search searches and even more so that is something which is not yet fully defined but that will be added in the next versions and of course we're doing some more optimization the because awful was point right now is cluster scaling yes yes please this means that this getting a the yeah so uh we have compared with other graph databases of course I mean I only have an
52:03
older version of the benchmark so it is to be honest that is from 2 thousand and 15 I guess a should be stated somewhere their 2015 and of his 15 and where we actually compares them against all the other specialized solutions so it as we did this benchmark we set Theorem Monte modern so probably are not as fast in a certain aspects as a specialized solution and just wanted to verify that so we executed a lot of things that are pretty common operations in those engines and richard figured out that we are not that bad as we expected so we thought maybe we get to 120 per cent everywhere so 1 per cent is the fastest 120 with the US we are a bit better than that in the case where we executed those tests so for example if we are talking about the graph features we've done shortest path neighbours and neighbours including the data and uh the 1st neighbors there is not including the data so this is always and I think neighbors 2 in this case so neighbors of neighbors distinct Urban shortest path computation with the pig some random points and just wanted to find the shortest path and what is this so we have 7 times faster than of would play and at 22 times faster than or intervene 4 neighbours um actually we implemented a clientside neighbor search for mom would which is 2 times slower than our implementation but it is still 2 times faster than a footrace built in single step implementation of neighbors um and if we include the data is all strings together because then parsing the data shifting the data across the network using the dataset that I've just shown you takes over a significant amount of time so there's only 1 and a half times faster because of some computational or some performance gain on the simpler algorithms but shifting a lot of data across the network then it's the bottled water but yeah just like that is based on on this in the resources to be on the verge of yes so that you the moral of of of yellow and yeah so so and nothing after to make it here again so the neighbors in this case is neighbors of neighbors so 2 steps so which is named on the left hand side but not not not over here 7 what we think but what I think is based on the vertex the and then the list of edges but not the other side vertex yes so only the edge which knows all we have to look over there and maybe this search over there is slower than we have to have it in our case so finding the others other vertices may be slower than have but I don't have that much inside just know the data from the day after OK adjusting edges have point 1 question OK thanks another 1 must you have to what so so the indifference and in yes yes OK so the difference between neighbours and neighbours was profiles between arraigned would be the and post press tabular so the thing is neighbors with profiles was not the same amount of data it is less money uh neighbors because it took a long time already pattern and then I don't know why I post Chris is actually foster source donor maybe declined implementation that use is just better in serializing revising the data so this is executed in no chess was the official drive of the vendor they know all the databases were accessed by no GS was there when the had offered driver so maybe they're parsing over them for the data was just added analysis which is just Jason passing OK good thank you very much Weber like the teachers I have a couple of those so please come up from the you can become a place few