We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Blazing-Fast Serverless MapReduce Indexer for Apache Solr

Formal Metadata

Title
Blazing-Fast Serverless MapReduce Indexer for Apache Solr
Title of Series
Number of Parts
64
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Indexing data from databases to Apache Solr has always been an open problem: for a while, the data import handler was used even if it was not recommended for production environments. Traditional indexing processes often encounter scalability challenges, especially with large datasets. In this talk, we explore the architecture and implementation of a serverless MapReduce indexer designed for Apache Solr but extendable to any search engine. By embracing a serverless approach, we can take advantage of the elasticity and scalability offered by cloud services like AWS Lambda, enabling efficient indexing without needing to manage infrastructure. We dig into the principles of MapReduce, a programming model for processing large datasets, and discuss how it can be adapted for indexing documents into Apache Solr. Using AWS Step Functions to orchestrate multiple Lambdas, we demonstrate how to distribute indexing tasks across multiple resources, achieving parallel processing and significantly reducing indexing times. Through practical examples, we address key considerations such as data partitioning, fault tolerance, concurrency, and cost. We also cover integration points with other AWS services such as Amazon S3 for data storage and retrieval, as well as DynamoDB for distributed lock between the lambda instances.