Raphtory: Streaming analysis of distributed temporal graphs

Cite

Related Material

FOSDEM VZW

Steer, Ben Cuadrado, Felix Clegg, Richard G.

Formal Metadata

Title

Raphtory: Streaming analysis of distributed temporal graphs

Title of Series

FOSDEM 2020

Number of Parts

490

Author

Steer, Ben

Cuadrado, Felix

Clegg, Richard G.

License

CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/46976 (DOI)

Publisher

FOSDEM VZW

Release Date

2020

Language

English

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

Temporal graphs capture the development of relationships within data throughout time. This model fits naturally within a streaming architecture, where new events can be inserted directly into the graph upon arrival from a data source, being compared to related entities or historical state. However, the vast majority of graph processing systems only consider traditional graph analysis on static data, with some outliers supporting batched updating and temporal analysis across graph snapshots. This talk will cover recent work defining a temporal graph model which can be updated via event streams and investigating the challenges of distribution and graph maintenance. Some notable challenges within this include partitioning a graph built from a stream, with the additional complexity of managing trade-offs between structural locality (proximity to neighbours) and temporal locality (proximity to an entities history). Synchronising graph state across the cluster and handling out-of-order updates, without a central ground truth limiting scalability. Managing memory constraints and performing analysis in parallel with ongoing update ingestion. To address these challenges, we introduce Raphtory, a system which maintains temporal graphs over a distributed set of partitions, ingesting and processing parallel updates in near real-time. Raphtory's core components consist of Graph Routers and Graph Partition Managers. Graph Routers attach to a given input stream and convert raw data into graph updates, forwarding this to the Graph Partition Manager handling the affected entity. Graph Partition Managers contain a partition of the overall graph, inserting updates into the histories of affected entities at the correct chronological position. This removes the need for centralised synchronisation, as commands may be executed in any given arrival order whilst resulting in the same history. To deal with memory constraints, Partition Managers both compress older history and set an absolute threshold for memory usage. If this threshold is met a cut-off point is established, requiring all updates prior to this time to be transferred to offline storage. Once established and ingesting the selected input, analysis on the graph is permitted via Analysis Managers. These connect to the cluster, broadcasting requests to all Partition Managers who execute the algorithm. Analysis may be completed on the live graph (most up-to-date version), any point back through its history or as a temporal query over a range of time. Additionally, multiple Analysis Managers may operate concurrently on the graph with previously unseen algorithms compiled at run-time, thus allowing modification of ongoing analysis without re-ingesting the data. Raphtory is an ongoing project, but is open source and available for use now. Raphtory is fully containerised for ease of installation and deployment and much work has gone into making it simple for users to ingest their own data sources, create custom routers and perform their desired analysis. The proposed talk will discuss the benefits of viewing data as a temporal graph, the current version of Raphtory and how someone could get involved with the project. We shall also touch on several areas of possible expansion at the end for discussion with those interested. The intended audience for this talk is a mixture of data scientists and graphy engineers. It is going to be quite high level, but introducing some interesting ideas of how to view data through the lens of a temporal graph as well as novel systems solutions for distribution, maintenance and processing.