We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Deep dive into an Elasticsearch plugin for query-time joins

Formal Metadata

Title
Deep dive into an Elasticsearch plugin for query-time joins
Title of Series
Number of Parts
60
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Data are often at the basis of critical decisions in many different sectors, ranging from e-commerce to cyber-security: machine and search logs, user information, metrics over transactions, etc. Data in such domains are by nature voluminous and inter-connected. Analytics systems are expected not only to search and aggregate those data, but also to join them: join operations are often necessary in order to explore inter-connected data and get insights from them. Analysts often interact with such systems by following an explorative and iterative process that represents their train of thoughts. Such systems then must have fast response times to avoid impeding the mental process of the analysts. Whilst Elasticsearch is a fantastic high performance analytics engine, it presents some limitations in certain cases when it comes to joining data from different indices at query-time. In this talk, we will present Siren’s ten years-long effort in implementing distributed joins on top of Elasticsearch. We will introduce Siren Federate – our Elasticsearch plugin that provides query-time join capabilities over indices – and we will discuss some of the challenges we had to tackle during its development. We will begin by describing how joins are performed by Federate, from the reception of a query till the computation of its results. Then we will show the importance of caching join results for performance, and how a cache can be efficiently implemented. Talking about performance, we will explain the benefits of adopting a vectorized data processing model by showing some experimental results. To conclude, we will discuss the importance of the expressiveness of a query language by illustrating the Federate DSL and how it integrates with some advanced features of Elasticsearch such as runtime fields.