Blazing-Fast Serverless MapReduce Indexer for Apache Solr

Zitieren

Zugehöriges Material

Plain Schwarz

Antuzi, Daniele

Formale Metadaten

Titel

Blazing-Fast Serverless MapReduce Indexer for Apache Solr

Serientitel

Berlin Buzzwords 2024

Anzahl der Teile

Autor

Antuzi, Daniele

Mitwirkende

N. N. (Moderation)

Lizenz

CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.

Identifikatoren

10.5446/70219 (DOI)

Herausgeber

Plain Schwarz

Erscheinungsjahr

2024

Sprache

Englisch

Inhaltliche Metadaten

Fachgebiet

Informatik

Genre

Konferenz/Talk

Abstract

Indexing data from databases to Apache Solr has always been an open problem: for a while, the data import handler was used even if it was not recommended for production environments. Traditional indexing processes often encounter scalability challenges, especially with large datasets. In this talk, we explore the architecture and implementation of a serverless MapReduce indexer designed for Apache Solr but extendable to any search engine. By embracing a serverless approach, we can take advantage of the elasticity and scalability offered by cloud services like AWS Lambda, enabling efficient indexing without needing to manage infrastructure. We dig into the principles of MapReduce, a programming model for processing large datasets, and discuss how it can be adapted for indexing documents into Apache Solr. Using AWS Step Functions to orchestrate multiple Lambdas, we demonstrate how to distribute indexing tasks across multiple resources, achieving parallel processing and significantly reducing indexing times. Through practical examples, we address key considerations such as data partitioning, fault tolerance, concurrency, and cost. We also cover integration points with other AWS services such as Amazon S3 for data storage and retrieval, as well as DynamoDB for distributed lock between the lambda instances.