How to avoid columnar calamities: what no one told you about Apache Parquet

Plain Schwarz

Benton, William

Formal Metadata

Title

Title of Series

Berlin Buzzwords 2021

Number of Parts

Author

Benton, William

Contributors

Khomenko, Sergii (Moderation)

License

CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/67361 (DOI)

Publisher

Plain Schwarz

Release Date

2021

Language

English

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

If you're dealing with structured data at scale, it's a safe bet that you're depending on Apache Parquet in at least a few parts of your pipeline. Parquet is a sensible default choice for storing structured data at rest because of two major advantages: its efficiency and its ubiquity. While Parquet's storage efficiency enables dramatically improved time and space performance for query jobs, its ubiquity may be even more valuable. Since Parquet readers and writers are available in a wide range of languages and ecosystems, the Parquet format can support a range of applications across the data lifecycle, including data engineering and ETL jobs, query engines, and machine learning pipelines. However, the ubiquity of Parquet readers and writers hides some complexity: if you don't take care, some of the advantages of Parquet can be lost in translation as you move tables from Hadoop, Flink, or Spark jobs to Python machine learning code. This talk will help you understand Parquet more fully in order to use it more effectively, with an eye towards the special challenges that might arise in polyglot environments. We'll level-set with a quick overview of how Parquet works and why it's so efficient. We'll then dive in to the type, encoding, and compression options available and discuss when each is most appropriate. You'll learn how to interrogate and understand Parquet metadata, and you'll learn about some of the challenges you'll run into when sharing data between JVM-based data engineering pipelines and Python-based machine learning pipelines. You'll leave this talk with a better understanding of Parquet and a roadmap pointing you away from some interoperability and performance pitfalls.