Using Pandas and Dask to work with large columnar datasets in Apache Parquet

EuroPython

Hoffmann, Peter

Formal Metadata

Title

Title of Series

EuroPython 2018

Number of Parts

132

Author

Hoffmann, Peter

License

CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this

Identifiers

10.5446/44897 (DOI)

Publisher

EuroPython

Release Date

2018

Language

English

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

Apache Parquet Data Format Apache Parquet is a binary, efficient columnar data format. It uses various techniques to store data in a CPU and I/O efficient way like row groups, compression for pages in column chunks or dictionary encoding for columns. Index hints and statistics to quickly skip over chunks of irrelevant data enable efficient queries on large amount of data. Apache Parquet with Pandas & Dask Apache Parquet files can be read into Pandas DataFrames with the two libraries fastparquet and Apache Arrow. While Pandas is mostly used to work with data that fits into memory, Apache Dask allows us to work with data larger then memory and even larger than local disk space. Data can be split up into partitions and stored in cloud object storage systems like Amazon S3 or Azure Storage. Using Metadata from the partiton filenames, parquet column statistics and dictonary filtering allows faster performance for selective queries without reading all data. This talk will show how use partitioning, row group skipping and general data layout to speed up queries on large amount of data.