#bbuzz: Building large scale, transactional data lakes using Apache Hudi

Plain Schwarz

Agarwal, Nishith

Formal Metadata

Title

Title of Series

Berlin Buzzwords 2020

Number of Parts

Author

Agarwal, Nishith

Contributors

N. N. (Moderation)

License

CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/68800 (DOI)

Publisher

Plain Schwarz

Release Date

2020

Language

English

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

With the proliferation of data in the past years, most business critical decisions are heavily influenced by deep data analysis. As companies rely more on data for their functioning; storing, managing and accessing data intelligently and efficiently is more important than ever before. As more business decisions are driven by data in real time, we require strong guarantees such as acceptable latencies, high data quality and system reliability. Moving from a full-reload to a delta model of ingesting quickly became the primary way to ingest large amounts of data at scale. A number of such ingest patterns showcased how a transaction support on such datasets could benefit use-cases immensely. Hudi, an apache project is attempting to introduce uniform data lake standards. Hudi is a storage abstraction library that uses Spark as an execution framework. In this talk, we will discuss how Hudi can provide ACID semantics to a data lake. We will discuss some of the basic primitives such as upsert & delete required to achieve acceptable latencies in ingestion while at the same time providing high quality data by enforcing schematization on datasets. Additionally, we will also discuss more advanced primitives such as restore, delta-pull, compaction & file sizing required for reliability, efficient storage management and to build incremental ETL pipelines. We will dig deeper into Hudi’s metadata model that allows for O(1) query planning as well as how it helps support Time-Travel queries to facilitate building feature stores for machine learning use-cases. Apache Hudi builds on open-source file formats; we will discuss how to easily onboard your existing dataset to Hudi format while keeping the same open-source formats so you can start utilizing all the features provided by Hudi without needing to make any drastic changes to your data lake. We will talk about the challenges faced in productionizing large Spark based Hudi jobs @scale at Uber and discuss how we addressed them. Finally, we will make the case for the future, discussing various other primitives that will facilitate in building rich and portable data applications.