We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Out-of-Core Columnar Datasets

Formal Metadata

Title
Out-of-Core Columnar Datasets
Title of Series
Part Number
79
Number of Parts
119
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production PlaceBerlin

Content Metadata

Subject Area
Genre
Abstract
Francesc Alted - Out-of-Core Columnar Datasets Tables are a very handy data structure to store datasets to perform data analysis (filters, groupings, sortings, alignments...). But it turns out that how the tables are actually implemented makes a large impact on how they perform. Learn what you can expect from the current tabular offerings in the Python ecosystem. ----- It is a fact: we just entered in the Big Data era. More sensors, more computers, and being more evenly distributed throughout space and time than ever, are forcing data analyists to navigate through oceans of data before getting insights on what this data means. Tables are a very handy and spreadly used data structure to store datasets so as to perform data analysis (filters, groupings, sortings, alignments...). However, the actual table implementation, and especially, whether data in tables is stored row-wise or column-wise, whether the data is chunked or sequential, whether data is compressed or not, among other factors, can make a lot of difference depending on the analytic operations to be done. My talk will provide an overview of different libraries/systems in the Python ecosystem that are designed to cope with tabular data, and how the different implementations perform for different operations. The libraries or systems discussed are designed to operate either with on-disk data ([PyTables], [relational databases], [BLZ], [Blaze]...) as well as in-memory data containers ([NumPy], [DyND], [Pandas], [BLZ], [Blaze]...). A special emphasis will be put in the on-disk (also called out-of-core) databases, which are the most commonly used ones for handling extremely large tables. The hope is that, after this lecture, the audience will get a better insight and a more informed opinion on the different solutions for handling tabular data in the Python world, and most especially, which ones adapts better to their needs.
Keywords