Out-of-Core Columnar Datasets

Cite

Related Material

EuroPython

Alted, Francesc

Formal Metadata

Title

Out-of-Core Columnar Datasets

Title of Series

EuroPython 2014

Part Number

Number of Parts

119

Author

Alted, Francesc

License

CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/19974 (DOI)

Publisher

EuroPython

Release Date

2014

Language

English

Production Place

Berlin

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

Francesc Alted - Out-of-Core Columnar Datasets Tables are a very handy data structure to store datasets to perform data analysis (filters, groupings, sortings, alignments...). But it turns out that how the tables are actually implemented makes a large impact on how they perform. Learn what you can expect from the current tabular offerings in the Python ecosystem. ----- It is a fact: we just entered in the Big Data era. More sensors, more computers, and being more evenly distributed throughout space and time than ever, are forcing data analyists to navigate through oceans of data before getting insights on what this data means. Tables are a very handy and spreadly used data structure to store datasets so as to perform data analysis (filters, groupings, sortings, alignments...). However, the actual table implementation, and especially, whether data in tables is stored row-wise or column-wise, whether the data is chunked or sequential, whether data is compressed or not, among other factors, can make a lot of difference depending on the analytic operations to be done. My talk will provide an overview of different libraries/systems in the Python ecosystem that are designed to cope with tabular data, and how the different implementations perform for different operations. The libraries or systems discussed are designed to operate either with on-disk data ([PyTables], [relational databases], [BLZ], [Blaze]...) as well as in-memory data containers ([NumPy], [DyND], [Pandas], [BLZ], [Blaze]...). A special emphasis will be put in the on-disk (also called out-of-core) databases, which are the most commonly used ones for handling extremely large tables. The hope is that, after this lecture, the audience will get a better insight and a more informed opinion on the different solutions for handling tabular data in the Python world, and most especially, which ones adapts better to their needs.

Keywords

EuroPython Conference

EP 2014

EuroPython 2014