Blazing-Fast Serverless MapReduce Indexer for Apache Solr

Cite

Related Material

Plain Schwarz

Antuzi, Daniele

Formal Metadata

Title

Blazing-Fast Serverless MapReduce Indexer for Apache Solr

Title of Series

Berlin Buzzwords 2024

Number of Parts

Author

Antuzi, Daniele

Contributors

N. N. (Moderation)

License

CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/70219 (DOI)

Publisher

Plain Schwarz

Release Date

2024

Language

English

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

Indexing data from databases to Apache Solr has always been an open problem: for a while, the data import handler was used even if it was not recommended for production environments. Traditional indexing processes often encounter scalability challenges, especially with large datasets. In this talk, we explore the architecture and implementation of a serverless MapReduce indexer designed for Apache Solr but extendable to any search engine. By embracing a serverless approach, we can take advantage of the elasticity and scalability offered by cloud services like AWS Lambda, enabling efficient indexing without needing to manage infrastructure. We dig into the principles of MapReduce, a programming model for processing large datasets, and discuss how it can be adapted for indexing documents into Apache Solr. Using AWS Step Functions to orchestrate multiple Lambdas, we demonstrate how to distribute indexing tasks across multiple resources, achieving parallel processing and significantly reducing indexing times. Through practical examples, we address key considerations such as data partitioning, fault tolerance, concurrency, and cost. We also cover integration points with other AWS services such as Amazon S3 for data storage and retrieval, as well as DynamoDB for distributed lock between the lambda instances.