The LDBC benchmark suite
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 542 | |
Autor | ||
Mitwirkende | ||
Lizenz | CC-Namensnennung 2.0 Belgien: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/62013 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
00:00
BenchmarkSuite <Programmpaket>DiagrammComputeranimation
00:16
COMBenchmarkLinked DataOpen SourceSharewareGraphentheorieGruppenkeimVerschlingungOrakel <Informatik>Schwach besetzte MatrixEDV-BeratungDienst <Informatik>GraphDatenbankFormale SpracheDatenverwaltungPunktwolkeBenchmarkHardwareEDV-BeratungComputeranimation
00:50
RechnernetzPhysikalisches SystemTreiber <Programm>BenchmarkOperations ResearchAbfrageSystemprogrammierungDatenbankGraphentheorieMessage-PassingBenchmarkDatenbankPhysikalisches SystemDivergente ReiheSpeicherbereinigungAutorisierungGraphMultigraphTreiber <Programm>SubstitutionAbfrageMessage-PassingKonditionszahlMusterspracheRelationale DatenbankFormale SpracheBildschirmmaskeTeilgraphGerichteter GraphReelle ZahlInhalt <Mathematik>MaßerweiterungTabelleFacebookKonstruktor <Informatik>Kategorie <Mathematik>CASE <Informatik>VisualisierungDistributionenraumSchedulingDickeKomponente <Software>Nichtlinearer OperatorTopologieNP-hartes ProblemResultanteMathematikComputeranimation
04:37
MaßstabVerschlingungSpannweite <Stochastik>Manufacturing Execution SystemLaufzeitfehlerAbfrageIntelGraphentheorieBeanspruchungProzess <Informatik>BenchmarkAtomarität <Informatik>ProgrammverifikationSchnittmengeTreiber <Programm>BeanspruchungGraphBenchmarkProzess <Informatik>ImplementierungBitPhysikalisches SystemInverser LimesNatürliche ZahlResultanteUnternehmensarchitekturBefehlsprozessorSoftwareentwicklerARM <Computerarchitektur>AbfrageTermAlgorithmische LerntheorieLaufzeitfehlerZentrische StreckungRechenschieberDreieckMultigraphPotenz <Mathematik>Computeranimation
07:15
BenchmarkGraphBenchmarkAnalytische MengeProgrammbibliothekComputeranimation
07:27
Twitter <Softwareplattform>WikiGraphentheorieAlgorithmusTeilbarkeitIterationDickeKoeffizientBitQuelle <Physik>GraphBenchmarkGerichteter GraphAlgorithmusKomponente <Software>GraphentheorieSoftwareComputeranimation
08:03
Web SiteBenchmarkDesignwissenschaft <Informatik>StandardabweichungFreewareSystemprogrammierungGebundener ZustandHypermediaPublic-domain-SoftwareGraphentheorieData MiningInteraktives FernsehenBusiness IntelligenceStandardabweichungFreewareBenchmarkMultiplikationsoperatorComputeranimation
08:57
RuhmasseComputeranimation
09:09
KontrollstrukturComputerSchnelltasteEin-AusgabeNotepad-ComputerVorzeichen <Mathematik>TouchscreenMathematikQuick-SortProzess <Informatik>Maß <Mathematik>DokumentenverwaltungssystemSupercomputerTermMailing-ListeMehrschichten-PerzeptronLeistung <Physik>ProgrammVersionsverwaltungZeitzoneWeg <Topologie>Kartesische KoordinatenKette <Mathematik>BenchmarkComputeranimation
09:52
DokumentenverwaltungssystemStatistikComputerSupercomputerMaß <Mathematik>Prozess <Informatik>MathematikPunktwolkeMereologieBenchmarkComputeranimationFlussdiagramm
Transkript: Englisch(automatisch erzeugt)
00:05
Hello HPC room. My name is Gabor Sarnas. I work at CWI Amsterdam as a researcher and today I'm here on behalf of the LDBC. The LDBC stands for the Linked Data Benchmark Council. We are a nonprofit company founded in 2012 and we design
00:21
graph benchmarks and govern their use. Additionally, we do research on graph schemas and modern GraphQL languages and everything we do is available under the Apache v2 license. Organizationally, LDBC consists of more than 20 companies. These are companies interested in graph data management. We have financial service providers, database vendors, cloud
00:43
vendors, hardware vendors and consultancy companies as well as individual contributors like me. So we design benchmarks, the first one being the LDBC social network benchmark which targets database systems. Let's go through this benchmark by a series of examples. I will touch on
01:01
data sets, queries and updates that we use in this benchmark. As the name social network benchmark suggests, we have a social network that consists of person nodes who know each other via a distribution that mimics the Facebook real social network. The content that these people create is messages. These form little tree-shaped subgraphs and are
01:23
connected via author edges to the people. On this graph, we can run queries like the following. Let's have a given person enumerate their friends and their friends of friends, get the messages that these people created and then filter them based on some condition on their dates. So a
01:40
potential substitution could be on this graph that we are interested in this query for Bob and the date set on Saturday. And if we evaluate this query, we start with Bob, we traverse the nose edges to Ada and Carl, then continue to Finn, Eve and Dan. We move along the author edges and then finally we apply the filter condition which will cut message 3 and
02:03
will leave us messages 1, 2 and 4. So obviously social network is not a static environment, there are always changes. For example, people become friends. Eve and Gia may add each other as a friend. That will result in a new nose edge. That's simple enough. Gia can decide to create a message. This
02:22
message will be replied to message and three. So we add a new node and connected to the existing graph via two edges. The heavy hitting updates are the deletes. A person may decide to delete their account and that will result in a cascade of deletes. For example, if we remove the node Eve, that will result in the removal of their direct edges or the
02:44
messages they created. And in some social network, this will even trigger the deletion of all the message trees and of course, all the edges that point to those messages. So this is quite a hard operation for systems to execute. It stresses their garbage collectors and it disallows certain append-only data structures. So if you want to view these three
03:04
components together, the dataset, the queries and the updates, we need a benchmark driver that schedules the operations to be executable. It runs the updates and the queries concurrently and of course, it collects the results. The
03:21
members who are the database vendors and we go to great lengths to allow as many candidate systems as possible. So graph databases, triple source and relational databases can all compete on this benchmark. Speaking of relational databases, some of you may think is SQL sufficient to express these queries? And the answer is that in most cases it is. So the
03:43
query that we have just seen can be formulated in a reasonably simple SQL query. It is a bit unwieldy, but it is certainly doable and the performance will be okay. However, this being a graph benchmark, it lends itself quite naturally to other query languages. There are two new query
04:01
languages that are going to be coming out and both of them adopted a visual graph syntax inspired by Neo4j's Cypher language. The first one is called SQL PGQ, where PGQ stands for property graph queries. This will be released this summer and as you can see, it's an extension to SQL. So you can use select and from, but it adds the graph table construct and the
04:22
query can be formulated in a very concise and readable manner. There is GQL, the graph query language, which is a standalone language that is going to be released next year and it shares the same pattern matching language as SQL PGQ. So the social network benchmark has multiple workloads to cover the diverse challenges that are created
04:44
by graph workloads. The first one, the older one, is the social network benchmark interactive workload. This is transactional in nature and it has queries like the one I have shown before. So these queries typically start in one or two person nodes. They are not very heavy hitting. They only touch on a limited amount of data. They have concurrent reads and updates and
05:04
systems are competing on achieving high throughputs. So this benchmark has been around for a few years and we have seen actually very good results. In the last three years, we witnessed an exponential increase in throughput, starting from a little above 5,000 operations per second to almost 17,000 operations per second this year. Our newer benchmark is the
05:24
social network benchmark business intelligence workload. This is analytical in nature and it has queries that touch on large portions of the data. For example, the query on this slide enumerates all triangles of friendships in a given country, which can potentially reach billions of edges and is a very difficult computational problem.
05:44
Systems here are allowed to do either a bulk or a concurrent update approach, but they should strive to get both a high throughput and low individual query runtimes. This benchmark being relatively new, we only have a single result, so it's a bit difficult to put it into context, but it allows me to highlight one
06:01
thing. Many of our benchmarks use different CPUs. We actually have quite a healthy diversity in the CPUs. We have results with the AMD Epic Genoa, like this one achieved by Tiger Graph. We have results using Intel Xeon Ice Lakes and the ETN 710s, which use an ARM architecture. We have more and larger
06:21
scale results expected this year and we are also quite interested in some graph and machine learning accelerators that are going to be released soon. So our benchmark process is quite involved. For each workload, we release a specification, we have an academic paper that motivates the benchmark, we have data generators, pre-generated data sets, as well as a benchmark driver and at least two
06:42
reference implementations. We do this because we have an auditing process that allows the vendors to implement this benchmark to actually go through a rigorous test and if they do so, they can claim that they have an official benchmark result. So we trademarked the term LDBC such
07:01
that the vendors have to go through these hoops of auditing and we still allow researchers and developers to do unofficial benchmarks, but they have to say that this is not an official LDBC benchmark result. One other benchmark I would like to touch upon briefly is the graph analytics benchmark. This casts a wide donut, so it targets graph databases, graph processing framework, embedded
07:23
graph libraries like NetworkX and so on. This uses untyped unattributed graphs, so it's only the person, those person graphs of the social network benchmark or other well-known graphs like Graph500. We have six algorithms, many of these are textbook algorithms like BFS, which just reverses
07:41
the graph from a given source node, or we have PageRank, which selects the most important nodes in the network. We also have clustering coefficient, community detection, connected components and shortest paths. This benchmark is a bit simpler to implement, we have a last one is going to come out in spring 2023, so talk to us if
08:01
you're interested. So wrapping up, you should consider becoming an LDBC member because members can participate in the benchmark design and have a say in where we go. They can commission audits of their benchmarks and they can also gain early access to the ISO standard drafts, CEQA, PGQ and GQI that I have shown. It's free for individuals and has a yearly fee for
08:22
companies. So to sum up, these are our three main benchmarks. We have other benchmarks and many future ideas. If you're interested, please reach out. Okay, again, we have time for one question. Any
08:43
questions for Gabor? This is a newbie question. I'm not into graphs. Apart from advertisement, optimisation, mass
09:03
surveillance, and perhaps content distribution, which other which I don't know if there are the major applications, but it's just what my naive mind comes with. What other applications are those benchmarks meant to optimise?
09:20
So the big one this year is supply chain optimisation, like strengthening supply chains, ensuring that they are ethical, ensuring that they are not passing conflict zones. It's something that is very important these days. You can also track CO2 emissions and other aspects of labour and
09:42
manufacturing. So that's certainly a big one. And that's something that we have seen. And there are of course, all the graphic problems like power grid, a lot of ecommerce programs and financial fraud detection, which is going to be part of our financial benchmark this year.