Raphtory: Streaming analysis of distributed temporal graphs
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 490 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/46976 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSDEM 202080 / 490
4
7
9
10
14
15
16
25
26
29
31
33
34
35
37
40
41
42
43
45
46
47
50
51
52
53
54
58
60
64
65
66
67
70
71
72
74
75
76
77
78
82
83
84
86
89
90
93
94
95
96
98
100
101
105
106
109
110
116
118
123
124
130
135
137
141
142
144
146
151
154
157
159
164
166
167
169
172
174
178
182
184
185
186
187
189
190
191
192
193
194
195
200
202
203
204
205
206
207
208
211
212
214
218
222
225
228
230
232
233
235
236
240
242
244
249
250
251
253
254
258
261
262
266
267
268
271
273
274
275
278
280
281
282
283
284
285
286
288
289
290
291
293
295
296
297
298
301
302
303
305
306
307
310
311
315
317
318
319
328
333
350
353
354
356
359
360
361
370
372
373
374
375
379
380
381
383
385
386
387
388
391
393
394
395
397
398
399
401
409
410
411
414
420
421
422
423
424
425
427
429
430
434
438
439
444
449
450
454
457
458
459
460
461
464
465
466
468
469
470
471
472
480
484
486
487
489
490
00:00
Turing testMathematical analysisStreaming mediaGraph theoryBuildingBitGraph theorySystem programmingPoint (geometry)Universe (mathematics)Event horizonStreaming mediaSet (mathematics)Computer animation
00:20
Process modelingSystem programmingGraph theoryIterationGraph (mathematics)Point (geometry)Physical systemResultantNumberSet (mathematics)Dirac delta functionBitMathematicsMiniDiscElectric generatorMultiplication signComputer animation
00:51
Streaming mediaComputing platformGraph theoryProcess modelingMathematical analysisMetric systemTemporal logicGraph (mathematics)MathematicsTime zoneForcing (mathematics)Category of beingQuantum statePlanningDirected graphCASE <Informatik>Multiplication signGraph (mathematics)Semiconductor memoryRevision controlFrequencyModel theoryGraph (mathematics)UsabilityRaw image formatMereologyProcess modelingCryptographySystem programmingSoftwareQuery languageResultantPhysical systemMathematical analysisMappingSynchronizationQuicksortEvent horizonComputer animation
02:10
Temporal logicModel theoryGraph theoryGraph (mathematics)Data managementMathematical analysisSemantics (computer science)Streaming mediaSoftware maintenanceRange (statistics)Graph (mathematics)Model theoryCategory of beingKey (cryptography)Temporal logicPartition of a setQuicksortMathematical analysisSet (mathematics)Point (geometry)Image resolutionTimestampStreaming mediaVertex (graph theory)Range (statistics)Computer fileMetric systemRaw image formatDifferent (Kate Ryan album)Multiplication signPosition operatorEnterprise architectureNeuroinformatikSynchronizationComputer animation
02:57
Graph theoryModel theoryTemporal logicPartition of a setMultiplication signGraph (mathematics)Graph (mathematics)Partition of a setSet (mathematics)Vertex (graph theory)Heat transferRaw image formatRouter (computing)Mathematical analysisData managementPoint (geometry)Heegaard splittingStreaming mediaSynchronizationFlow separationTerm (mathematics)outputCategory of being2 (number)FunktionalanalysisCASE <Informatik>Message passingLine (geometry)Virtual machineComputer animation
04:12
Graph (mathematics)Vertex (graph theory)AdditionObject (grammar)Point (geometry)Graph (mathematics)CASE <Informatik>Type theoryElectronic mailing listVortexQuantum stateFree variables and bound variablesInstance (computer science)Vertex (graph theory)Multiplication signPartition of a setCategory of beingDirected graphGraph (mathematics)Position operatorMathematical analysisPhysical systemPhysical lawSystem programmingData managementVirtual machineSemiconductor memoryVotingRule of inferenceShared memoryMetadataSoftwareQuicksort
06:18
Mathematical analysisMathematical analysisData managementIterationPartition of a setSet (mathematics)Router (computing)ResultantDirected graphInformationBitSynchronizationQuicksortVertex (graph theory)AlgorithmProgram flowchart
07:01
Graph theoryView (database)Revision controlGraph (mathematics)Partition of a setQuery languageTemporal logicCASE <Informatik>QuicksortMultiplication signSet (mathematics)Different (Kate Ryan album)Error messageInterrupt <Informatik>Office suiteRaw image formatDistanceSemiconductor memoryGraph (mathematics)Data managementComputer animation
07:53
View (database)Right angleWindowStapeldateiTemporal logicMathematical analysisRange (statistics)CASE <Informatik>WindowTerm (mathematics)Mathematical analysisMultiplication signView (database)Point (geometry)Graph (mathematics)Range (statistics)Moment (mathematics)QuicksortReading (process)Graph (mathematics)1 (number)Set (mathematics)PlanningStapeldateiSoftwareVideo gamePattern languageSystem callTimestamp
09:38
Connectivity (graph theory)View (database)CASE <Informatik>TwitterSoftwareSpeech synthesisQuery languageOpen setMathematicsGraph (mathematics)Connected spaceDifferent (Kate Ryan album)Connectivity (graph theory)Multiplication signQuicksortFlow separationWindowDrop (liquid)Pattern languageGraph (mathematics)Control flowRange (statistics)BitLevel (video gaming)DialectLine (geometry)PlanningGraph theoryAlgorithmSelectivity (electronic)ResultantCloningFreewareWebsiteScaling (geometry)Computer animation
11:32
TwitterFunction (mathematics)View (database)Mathematical analysisRevision controlVirtual machineConnectivity (graph theory)InformationRankingVertex (graph theory)Degree (graph theory)Graph (mathematics)Scripting languageDifferent (Kate Ryan album)RandomizationStructural loadWeb pageInformationRankingDegree (graph theory)Graph (mathematics)Mathematical analysisDatenverknüpfungConnected spaceGraph theoryMultiplicationImage resolutionFerry CorstenFunktionalanalysis1 (number)CryptographyMathematicsVirtual machineComputer animation
12:19
Turing testQuery languageDrop (liquid)Message passingProcess modelingMoment (mathematics)VarianceTheory of relativityPhysical systemPoint (geometry)Multiplication signBuildingRevision controlQuicksortRaw image formatMathematical analysisView (database)ApproximationRoboticsOffice suiteProjective planeException handlingWebsiteMiniDiscDifferent (Kate Ryan album)Semiconductor memoryMechanism designSoftwareFocus (optics)Product (business)Universe (mathematics)Heat transferReal-time operating systemComputer virusGraph (mathematics)Degree (graph theory)Line (geometry)Network topology1 (number)Query languageCASE <Informatik>Directed graphInsertion lossVertex (graph theory)TimestampFluid staticsDataflowVector spaceLogical constantSimulationAlgorithmProper mapHeuristicRollback (data management)Graph (mathematics)Game theoryMathematicsStreaming mediaBlock (periodic table)Server (computing)Computer animation
18:09
Open sourcePoint cloudFacebook
Transcript: English(auto-generated)
00:05
Oh, so hello, my name's Ben. I'm over at Queen Mary University in London, and I'm gonna be talking about Raftria's system that me and my two advisors have been building, which is looking at distributed temporal graphs and how we're building and maintaining them from a set of event streams.
00:21
So I just wanted to give you a little bit of an idea of why we got to this point. So kind of some of the original distributed graph processing systems had this idea that you've got a big chunk of data on disk, you've got some chosen algorithms, so you load it in, turn it into your graph, you churn through in a couple of iterations and out pops your result. And then if you, say, want to see how things have changed throughout time, you might have snapshots, you know,
00:41
once a day for the last six number of months. And again, you load all these in, build them into graphs, you get a set of outputs, and then these can, you know, you can do some deltas between them to see how things have changed. And that's quite sort of coarse, and you know, if you only have snapshots once every day, then you kind of lose what happens in between and so on. This has kind of improved with these stream-based graph processing systems
01:01
where you have some event source out in the wild. So some of the examples we've been looking at are like cryptocurrencies, mapping data, so people moving around cities, and obviously social networks. And then changes in these event sources can then affect your in-memory graph. So in the case of a social network, you might have a user joins the network, someone follows their friend, and so on. So these can all be inserted in.
01:22
And then users of your system can then query this, they request processing and get their results back. So this is great if you want to do some analysis on the most recent version of the graph, or alternatively, if you've got some metric that you're interested in monitoring and then seeing how this changes over time. So what we were thinking was is that, well, if you've got all these changes coming in, and you have all these problems with trying to keep all your graph in sync
01:41
and up to date, why don't we just try to keep all of the changes and build a full temporal graph? So this could, in some ways, simplifies the way that we actually synchronize, but then also allows us to do things like comparing the newest state to all of the previous versions of the state, and then actually do proper temporal queries. So something like, if you're doing the shortest path, it might be, I only want to go out on edges that are younger than the one I come in on,
02:01
or alternatively, you might have, say, for, I don't know, planes flying around, edges only exist for a certain period of time, and you need to get there and get on that edge before it disappears. So those ideas in mind, let me come up with Raftory. Our initial work was on formalizing this temporal graph model and the update semantics, so how we add, remove vertices and edges,
02:22
as well as updating a set of properties that they have, so a key value set of properties associated with them, how we actually distribute and manage this graph in memory, so we have a set of partitions which have a set of vertices and edges each, and then how we sort of stream all of these updates into these partitions and keep them in sync, and then we also provide this sort of Pregel-like temporal graph analysis model
02:41
in which the user can request to do some analysis on the live graph, any point back in time, down to the resolution of the actual timestamps on the data, so you could say, what does it look like last Thursday at 3.02 in the afternoon, and then actually look through ranges, so sort of hop throughout the whole history of the graph and compute these different metrics and see how they change. So I'll just go over a quick run through the architecture.
03:02
So over here we have this data spout, so this is kind of how the user decides how to connect to the outside world, so this is something like, I wanna read this file, connect to this database, listen to this Kafka stream or something along these lines, this goes into a set of graph routers, effectively what a router does is it takes a user-defined function, and what that raw input transfers into
03:20
in terms of graph updates, so what is a vertex, what is an edge, if something comes in, this is actually an update to a property and so on, and then they forward it off to the correct partition manager or partition of the graph which deals with that vertex or edge that are affected, and then as this is constantly running and maintaining, users can submit analysis requests which talk to the partitions, so I'll go onto that in a second.
03:42
So if we dive into one of these partitions, they'll have a set of vertices and edges as I said, and all of these will then have some history appended to them, so in this case, this vertex was created at time eight, then had some update appended to it at time 14, and this edge was created at time 14, possibly while this vertex has an update, and was then deleted at some point later on.
04:00
As we're split across several partitions, we use an edge partitioning algorithm, so this edge down here, because vertex one and vertex two are different machines, are actually split across the two machines, and you can see they're in sync. Right, so one thing that's really interesting about this type of history is that now all of our updates kind of become additive, so even if we have a delete happen first
04:21
and an add happen after, as long as we keep this chronological list, we can just slot them into the correct position, so you always end up with the right graph. So this is kind of nice, because if you have this problem of updates coming in the wrong order for a lot of other systems, you either have to drop them, ignore them, or you get an incorrect state in one of your partitions. So as an example of this, we may have say this edge ad that comes in at time 14,
04:42
because partition manager one deals with, because vertex one is the source node, so we insert that into the machine, the edge gets created, and the vertex one gets updated. We then synchronize across to the other node, say, hey, I've got an edge that I share with you, so that gets updated into vertex two, and the edge gets created there as well. So that's all fantastic,
05:01
everything's happened in the correct order, everything's brilliant. What happens if, say, an edge gets added before we get a vertex? Well, in this case, we can create both objects, the vertex actually just becomes a placeholder, and then, so again, we synchronize, do everything exactly the same, and then if the vertex ad comes in at a later point, we just slot that into the history behind,
05:20
so then if this comes in with all the properties, all this sort of interesting metadata about that vertex, it can be inserted at that point. And then obviously, things can go completely haywire, so for some reason, some packets have been lost, or the network's gone all over the place, and in this case, this vertex has actually been deleted before it's even been created. So in a lot of systems, you might find that this is just, okay, well, this is nonsense, let's just drop it and ignore it,
05:40
and that's obviously not what we wanna do here. So again, we have a placeholder object which holds its deletion. When the edge ad in this instance comes into the other machine, it does what it does in that machine, and then it synchronizes across, at which point the vertex now gets its creation at the correct point, and we can insert this edge, and then because this vertex was deleted, all of its incoming and outgoing edges should be deleted,
06:01
so we don't have anything hanging, and that can then synchronize back. So even though this went sort of completely wrong, you still end up with the same state and the same temporal graph moving forward. So we're stuck in some watermarking, so you kind of know when it gets to the point where this is safe to do, or if you wanna, you can go with the approximate approach of just give me what's going in memory now. So on that point, I know this is a bit sort of whistle-stop,
06:22
but we'll pop onto the analysis. So the general idea is that the routers are constantly ingesting new information from whatever source you specified, assuming it's unbounded, and then the partition managers constantly keep in sync with each other and wait for requests from an analysis manager. So the user says, hey, I wanna run this analysis. Can I submit it? So this goes off.
06:41
All the partition managers will then go through their set of vertices, run this sort of vertex-centric algorithm, and then return to the analysis manager. Analysis manager can say, okay, well, all my vertices have either decided to vote to halt or another iteration is required, and this will go back and forth until it's happy that it's finished and the result can be returned to the user. So what can the user actually request?
07:02
Well, the first thing is that if we have this temporal graph in memory, we can say, okay, well, give me what the live graph looks like. So this is the most recent version of the graph, either watermarked, as I said, so this is kind of the safe live graph, or alternatively, you could ask for the bleeding edge, absolute most recent version
07:20
in which you have some error of a proximity, so depending on what sort of use case you have there. Alternatively, you might say, okay, well, give me what it looked like last week, last month, a year ago, something like this, and these tend to be sort of stored in memory, so we'll build that view, and I'll go over that in a second. Or alternatively, if Raftery's been running for a very long time and you start to have to push the older stuff out of memory, then we can start loading some of those things back
07:41
if you wanna go back that far. We're also looking for ways of sort of offloading very old queries into a different set of partition managers, so that query sort of doesn't interrupt what's going on in the most recent version of the graph, but that obviously is future work. Cool, so if we then, say, we have this full history of everything that's been ingested
08:00
from time zero up until time N, so the newest update, we might say, okay, well, I wanna see what the graph looked like at T10, and a good way of viewing that, what a view is, it's kind of like a right-hand filter, so you say, okay, well, this is everything that's happened, I'm not interested in anything that's happened after this point, so let's just kind of get rid of that for the moment, so that you kind of get to see now
08:21
what the graph looked like exactly at that point of time, and then that can be used for your analysis. But one of the things we found was that if you're looking at very large data sets that have been existed for years and years, then there's an awful lot of patterns that happen in the short term that kind of get hidden by this huge aggregate amount of data. So we added in something,
08:40
we like to call graph windowing, which kind of is like the left-hand filter, and in this case, you're saying, okay, well, I'm only interested in things that have happened in this span of time, so it could be from this timestamp for the last day, the last week, the last month, and so on, and so then you can actually view these short-term patterns as well as long-term ones, and so for that, we actually offer windowing batches, so you can say, okay, well, start at this point,
09:01
and then just decrease the size of the window down continuously until you've reached, you've done all the ones I'm interested in. As well as these individual views, you might actually say, okay, well, I'm interested in this sort of range of time, so over the last year or something, and I wanna hop through at some set interval, maybe an hour or a day, and again, you can do this,
09:22
so we can say build a view at the oldest point, so time four, and then we can hop forward to time six, and then a view is generated, time eight and time 10, so again, if you're doing these ranges, you can have all this windowing and window batching on top as well, so that's obviously sort of very theoretical, and sort of a concrete use case would obviously be very nice
09:42
as I imagine a lot of you thinking, so one of the things here that we had looked at was a network called gab.ai, has anyone heard of gab? Good, I wouldn't think so, so gab is a Twitter clone, it's kind of this like right-wing forum, but they had an open REST API, so I downloaded all of their posts,
10:01
so they're like the big free speech thing, so I don't know, I assume that's why it's open, but so I scraped them all between the end of 2016 up until mid-2018, and we then had a look at, if we set a query running for that whole range of time and hopped forward an hour at a time, what do we see any changes in something like, just something simple like the largest connected component, so for this, we then set several different window sizes,
10:21
so we said, okay, have a look at a very small window, like an hour, a day, a week, a month, a year, and then the full aggregate graph, and the interesting thing here is that even though you're running kind of the same algorithm, you actually notice very different patterns, so the aggregate kind of shows all the connected component continuously grows, whereas actually if you look at something like the month, you have these peaks of interest, so this is actually Donald Trump's election,
10:41
this is the Charlottesville riots, so there's kind of like these peaks of activity when people join the network and start using it, and then that sort of drops down again, and then if we zoom in a little bit further down to the hour scale, you can see that everything above like a window size of a day, the largest connected component is always like 100% of the graph, so everyone's always connected, it doesn't really change very much,
11:01
however, for an hour, you get this lovely diurnal pattern as people kind of go to sleep and wake back up, so the largest connected component, so it's like 80% of the graph, so almost everyone is connected, but then as people start to go to sleep, this all breaks down into very small communities that are talking to each other in the wee hours, and then that brings back up again as people start coming back online, so yeah, so even doing the same query,
11:22
running on these sort of different lenses or views of the graph gives you very different results, so we're kind of starting to explore this a little bit now, and we're obviously interested in anyone that wants to talk about this sort of stuff, so on that point, if you are interested in using Raftery, it is available on GitHub, it's all Dockerized,
11:40
and has some actually pretty dreadful scripts to run it, but I'm working on improving those, so yeah, so you can run it in there, there's examples of the actual GAB graph that I just went over, we've got loads of spouts for ingesting different data, so GAB, Twitter, Bitcoin, Ethereum, and loads of other random ones, we have actually ingested the whole Bitcoin and Ethereum graphs over a big cluster of machines,
12:01
and are working with a couple of different companies to do some know your customer entity resolution stuff, and so yeah, we also have multiple analysis functions, so things like connected components and page rank, we're looking at information diffusials, so this is like spreading taint across cryptocurrency, and then simple things like degree ranking and so on.
12:21
So for the future of Raftery, we've just been funded by the Alan Turing Institute in London, for any of you guys that know it, to turn this from kind of the initial researchy project into an actual product that researchers can use, so we're partnering with Leeds University to look at some very large transport data sets, mapping people moving around cities,
12:41
and then see how that changes over time, if the council does something like they put in a pelican crossing, how long do they have to monitor to see sort of different changes in foot traffic, and also we're now spinning this out of Queen Mary into a company called Corograph, so if you do see this name pop up, then it's probably me, or not someone trying to steal my name, but yeah, if you are interested, please drop me a line,
13:02
or leave anything on the, I'm always on there, so thank you very much for listening. Yeah. Can you achieve performance improvements
13:21
by like taking snapshots? So the question is, can we achieve performance improvements by taking snapshots? So do you mean for the actual processing side, or for the ingestion side of things? For the processing side. So on the processing side, we're looking at this a little bit,
13:41
so all the in-memory stuff, when you build a view, all of the previous versions are already there, so you just go to the vertex, you pull what it would look like at that point in time, if it was, so you'd filter initially, and you then sort of, you can use that vertex as it exists in memory already. For the stuff that's pulled back in from disk,
14:00
we're having a look at different snapshotting slash replay mechanics to make sure that they work properly, so there'll be kind of this idea of, okay, every X minutes, you take a snapshot, and then you have the replay of messages using Akka, sort of the message replay, to say, okay, let's get back to exactly the point we're interested in, and then what's the sort of heuristics around that, so that's kind of the next step
14:20
that we're looking at at the moment, but yeah, so it's, for everything that's in memory, it works pretty much as is. If I've correctly understood you, you were thinking of that, I have one starting network, and a constant stream of messages coming in from a single point of truth. Now, if you have a simulation environment,
14:42
the multiple people want to make different modifications of network, and look at variants of your network. Is that something that you can handle? So, we're having a look at perhaps some sort of like, I don't know, vector clock implementation, or something where you get the sort of different timestamps from all over the place coming in. For the moment, our assumption is that
15:01
if you're attaching to an outside source, that that's gonna be a, the timestamps coming in are the timestamps we're using. So, a lot of the ones, or say for example on the social networks, they tend to have, that's done within their servers. For the cryptocurrencies, that's done when the block is published. So, for the most of our use cases, we've kind of focused on that. It'd be really interesting to see how we would do it, but I think for the moment,
15:20
it's not gonna be so much of a. My idea would be maybe you model some production, somebody says, okay, what happens if I place a production order here, and another person sitting in another office says, what happens if the. Oh, I see what you mean. And of course, this gives you variants of your production network. And if you don't want to keep full copies.
15:40
Ah, yeah, no. You want to keep these differences somehow. Ah, so yeah, so perhaps some sort of rollback feature would be if a sort of update that come in that shouldn't have come in was. So yeah, so we could, I don't know how we'd do it at the moment, but it's definitely something to consider actually. Yeah, or I'll put in my notes of things
16:02
to add in at some point. Yeah, no, thank you very much. Any more questions? Okay. What's the biggest challenge of implementing graph algorithms on these dynamic graphs? Or are there any things that you see from like, because you have this kind of concept flow of data into the graph,
16:22
so is there anything that you did differently from implementing graph algorithms than you did here on a static graph? No, so at the moment, the kind of view is that when you build, if you're building a view which is kind of safe, so either it's been watermarked or is a previous point in time, then it's kind of just a static graph and that's fine. When it comes to, so if you have got like, you're doing analysis,
16:41
so obviously it runs in parallel, but if you're doing analysis on the most recent version, you do have some degree of approximation. We're having a look at like, if we can kind of work out what that degree is, or something around that, but we've only really just started working on the analysis the last sort of six to nine months. So it's definitely the next sort of frontier of it for sure.
17:01
Could you use this for pathfinding in dynamic environments, things like for maybe like self-driving cars, but also maybe like in games that have dynamic environments? Or is it too slow for that? So I guess the question is on,
17:21
can you use it for pathfinding in dynamic environments? I think it depends on the sort of speed at which you're interested in. And of course the size of the graph. So I think you probably could, if you're going for like a, if you're interested in maybe around a couple of hundred milliseconds, I don't know if you're interested in sort of
17:41
proper real time and it needs to be sort of microsecond or sub millisecond, it probably wouldn't return fast enough or something like that. But then it's probably, it's been more optimized for sort of general queries throughout time like the ones we're showing. So you kind of chunk it in, you leave it running and it kind of goes forever. Again, if we can, if we had the data to have it play around with it, I'd love to give it a go.
18:05
Cool, thank you very much.