Running petabyte-scale columnar stores has become a routine operation in today's data-driven world. However, running a petabyte-scale search system is still a challenging task operationally. Enter Kaldb, an open-source, serverless Lucene serving system designed specifically for petabyte-scale Lucene workloads. We've designed Kaldb to automate and reduce operational toil without sacrificing performance or reliability. But designing a serverless Lucene system at this scale poses several unique challenges, such as ensuring durability of data, modifying replication and caching protocols for high availability, high fanout reads, managing ephemeral nodes, and more. In this talk, we'll delve into the details of how our redesigned Kaldb system overcomes these challenges. We've separated durability of the data from storage, separated compute from storage, modified replication algorithms to handle ephemeral nodes, use Kafka as a write ahead log and developed a novel query execution layer to handle high-fanout queries. Our implementation not only reduces operational toil but also adds several self-healing properties to the system. We're proud to say that Kaldb currently runs on Kubernetes at petabyte scale with improved reliability and performance. Join us in this talk to learn more about how Kaldb can help you overcome the challenges of running a petabyte-scale Lucene serving system. We'll share our experiences, best practices, and lessons learned in designing and operating a serverless Lucene serving system at this scale, and provide practical insights and techniques that you can use to optimize your own search systems. |