Deep dive into an Elasticsearch plugin for query-time joins

Zitieren

Zugehöriges Material

Plain Schwarz

Campinas, Stéphane

Formale Metadaten

Titel

Deep dive into an Elasticsearch plugin for query-time joins

Serientitel

Berlin Buzzwords 2023

Anzahl der Teile

Autor

Campinas, Stéphane

Mitwirkende

N. N. (Moderation)

Lizenz

CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.

Identifikatoren

10.5446/66618 (DOI)

Herausgeber

Plain Schwarz

Erscheinungsjahr

2023

Sprache

Englisch

Inhaltliche Metadaten

Fachgebiet

Informatik

Genre

Konferenz/Talk

Abstract

Data are often at the basis of critical decisions in many different sectors, ranging from e-commerce to cyber-security: machine and search logs, user information, metrics over transactions, etc. Data in such domains are by nature voluminous and inter-connected. Analytics systems are expected not only to search and aggregate those data, but also to join them: join operations are often necessary in order to explore inter-connected data and get insights from them. Analysts often interact with such systems by following an explorative and iterative process that represents their train of thoughts. Such systems then must have fast response times to avoid impeding the mental process of the analysts. Whilst Elasticsearch is a fantastic high performance analytics engine, it presents some limitations in certain cases when it comes to joining data from different indices at query-time. In this talk, we will present Siren’s ten years-long effort in implementing distributed joins on top of Elasticsearch. We will introduce Siren Federate – our Elasticsearch plugin that provides query-time join capabilities over indices – and we will discuss some of the challenges we had to tackle during its development. We will begin by describing how joins are performed by Federate, from the reception of a query till the computation of its results. Then we will show the importance of caching join results for performance, and how a cache can be efficiently implemented. Talking about performance, we will explain the benefits of adopting a vectorized data processing model by showing some experimental results. To conclude, we will discuss the importance of the expressiveness of a query language by illustrating the Federate DSL and how it integrates with some advanced features of Elasticsearch such as runtime fields.