Francesc Alted - Out-of-Core Columnar Datasets
Tables are a very handy data structure to store
datasets to perform data analysis (filters, groupings, sortings,
alignments...).
But it turns out that how the tables are actually implemented makes a large impact on how they perform.
Learn what you can expect from the current tabular offerings in the Python ecosystem.
-----
It is a fact: we just entered in the Big Data era. More sensors, more
computers, and being more evenly distributed throughout space and time
than ever, are forcing data analyists to navigate through oceans of
data before getting insights on what this data means.
Tables are a very handy and spreadly used data structure to store
datasets so as to perform data analysis (filters, groupings, sortings,
alignments...). However, the actual table implementation, and
especially, whether data in tables is stored row-wise or column-wise,
whether the data is chunked or sequential, whether data is compressed or not,
among other factors, can make a lot of difference depending on the
analytic operations to be done.
My talk will provide an overview of different libraries/systems in the
Python ecosystem that are designed to cope with tabular data, and how
the different implementations perform for different operations. The
libraries or systems discussed are designed to operate either with
on-disk data ([PyTables], [relational databases], [BLZ],
[Blaze]...) as well as in-memory data containers ([NumPy],
[DyND], [Pandas], [BLZ], [Blaze]...).
A special emphasis will be put in the on-disk (also called
out-of-core) databases, which are the most commonly used ones for
handling extremely large tables.
The hope is that, after this lecture, the audience will get a better
insight and a more informed opinion on the different solutions for
handling tabular data in the Python world, and most especially, which
ones adapts better to their needs. |