We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Out-of-Core Columnar Datasets

00:00

Formal Metadata

Title
Out-of-Core Columnar Datasets
Title of Series
Part Number
79
Number of Parts
119
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production PlaceBerlin

Content Metadata

Subject Area
Genre
Abstract
Francesc Alted - Out-of-Core Columnar Datasets Tables are a very handy data structure to store datasets to perform data analysis (filters, groupings, sortings, alignments...). But it turns out that how the tables are actually implemented makes a large impact on how they perform. Learn what you can expect from the current tabular offerings in the Python ecosystem. ----- It is a fact: we just entered in the Big Data era. More sensors, more computers, and being more evenly distributed throughout space and time than ever, are forcing data analyists to navigate through oceans of data before getting insights on what this data means. Tables are a very handy and spreadly used data structure to store datasets so as to perform data analysis (filters, groupings, sortings, alignments...). However, the actual table implementation, and especially, whether data in tables is stored row-wise or column-wise, whether the data is chunked or sequential, whether data is compressed or not, among other factors, can make a lot of difference depending on the analytic operations to be done. My talk will provide an overview of different libraries/systems in the Python ecosystem that are designed to cope with tabular data, and how the different implementations perform for different operations. The libraries or systems discussed are designed to operate either with on-disk data ([PyTables], [relational databases], [BLZ], [Blaze]...) as well as in-memory data containers ([NumPy], [DyND], [Pandas], [BLZ], [Blaze]...). A special emphasis will be put in the on-disk (also called out-of-core) databases, which are the most commonly used ones for handling extremely large tables. The hope is that, after this lecture, the audience will get a better insight and a more informed opinion on the different solutions for handling tabular data in the Python world, and most especially, which ones adapts better to their needs.
Keywords
80
Thumbnail
25:14
107
Thumbnail
24:35
Core dumpSoftware developerMiniDiscExpressionSoftware maintenanceSemiconductor memoryDatabaseConstraint (mathematics)Table (information)Domain nameCore dumpInteractive televisionExpert systemSystem callContext awarenessPrice indexLecture/Conference
Software developerSoftware maintenanceTerm (mathematics)HochleistungsrechnenBefehlsprozessorRead-only memoryBitInstance (computer science)SupercomputerPhysical lawInformation technology consultingDifferent (Kate Ryan album)Diagram
Principle of maximum entropyEvoluteSemiconductor memoryMaxima and minimaOperator (mathematics)Range (statistics)DatabaseBefehlsprozessorMultiplication signVideo gameDivisorStability theoryDifferent (Kate Ryan album)SpacetimeSound effectTable (information)Data structureLecture/Conference
String (computer science)Read-only memoryTable (information)Simultaneous localization and mapping3 (number)Twin primeArmReading (process)Glass floatDialectData compressionCache (computing)NeuroinformatikInformationRow (database)BefehlsprozessorIntegerDifferent (Kate Ryan album)Multiplication signArithmetic meanTable (information)Data compressionSet (mathematics)MereologyNichtlineares GleichungssystemLimit (category theory)Semiconductor memoryMarginal distributionBinary multiplierIntegrated development environment
Read-only memoryObject (grammar)Resource allocationCASE <Informatik>Military operationSemiconductor memoryBefehlsprozessorUniform resource locatorBlock (periodic table)Default (computer science)AreaUMLProgram flowchart
AutomatonData compressionCarry (arithmetic)HypermediaAdditionData compression3 (number)Electronic mailing listLecture/ConferenceJSONXMLUML
Data compressionData compressionMultiplication signNeuroinformatikSemiconductor memoryPopulation densityLecture/Conference
Data compressionBefehlsprozessorCurve fittingComputer-aided designMultiplication signMultiplication signData transmissionBefehlsprozessorMiniDiscData compressionSemiconductor memoryInformationTransmissionskoeffizientState of matterEvent horizonCache (computing)Program flowchart
Data compressionRead-only memoryImplementationInclusion mapLibrary (computing)ImplementationData compressionIterationSemiconductor memoryMereologyPower (physics)Bit rateRight angleLibrary (computing)Functional (mathematics)System callMathematical analysisTable (information)Insertion lossArithmetic meanCASE <Informatik>JSON
Online chatCarry (arithmetic)Evelyn PinchingArray data structureDatabaseData dictionaryData compressionObject (grammar)Table (information)Lecture/Conference
Carry (arithmetic)Order (biology)Object (grammar)Order (biology)Physical lawInformationFitness functionProgram flowchartLecture/Conference
Carry (arithmetic)Order (biology)Object (grammar)Data dictionaryAdditionLibrary (computing)Default (computer science)Table (information)Data compressionFile formatSet (mathematics)MiniDiscSemiconductor memoryProgram flowchartLecture/Conference
Object (grammar)MiniDiscFile formatMilitary operationRead-only memoryLibrary (computing)Core dumpOperations researchStreaming mediaData acquisitionWebsiteInterface (computing)System callDivisorField (computer science)Frame problemTable (information)Alphabet (computer science)Semiconductor memoryElement (mathematics)Default (computer science)Electric generatorMatching (graph theory)Operator (mathematics)Library (computing)Revision controlBlock (periodic table)Cycle (graph theory)Subject indexingMappingComputer fileAnalytic setEvent horizonOrder (biology)Object (grammar)Group actionBuildingSet (mathematics)MiniDiscStandard deviationFitness functionQuicksortIterationBenchmarkReal numberJSONLecture/Conference
BenchmarkReal numberMaterialization (paranormal)Insertion lossBitLaptopRepository (publishing)Computer animation
Frame problemBenchmarkVirtual memoryDatenpfadMotion blurSummierbarkeitPointer (computer programming)3 (number)Computer configurationElectronic data interchangeObject (grammar)Overhead (computing)Total S.A.Addressing modeRegular graphSoftware repositoryResultantLaptopProcess (computing)Traffic reportingComputer animation
BenchmarkReal numberDifferent (Kate Ryan album)FrequencyFrame problemRevision controlField (computer science)Bit rateQuery languageNumberInformationIterationGroup actionComputer fileSelectivity (electronic)FluxElectronic data processingSystem callTable (information)JSON
SpacetimeMetropolitan area networkQuery languageDressing (medical)Multiplication signLaptopBitMultiplication signBefehlsprozessorView (database)2 (number)DivisorBenchmarkQuery languageOperating systemData compressionMatching (graph theory)Semiconductor memoryLaptopFile systemCASE <Informatik>Right angleInheritance (object-oriented programming)Pairwise comparisonSound effectContent (media)Presentation of a groupArithmetic meanSquare numberSystem callPhysical systemFrame problemOperator (mathematics)JSONXMLUMLProgram flowchart
Query languageLaptopMultiplication signMaxima and minimaData compressionBefehlsprozessorBlock (periodic table)Data compressionArithmetic mean1 (number)LaptopComputer architectureLecture/ConferenceJSONXMLUMLProgram flowchart
Revision controlFocus (optics)DisintegrationInheritance (object-oriented programming)Data compressionElectronic mailing listRevision controlOrder (biology)File systemBlock (periodic table)Data compressionReal numberComputer fileINTEGRALRight angleResultantSource codeConsistencyAreaSemiconductor memorySummierbarkeitMultiplication sign
Data compressionElectronic mailing listEmailElectronic mailing listData compressionSystem callBlock (periodic table)File formatSymbol tableRepository (publishing)Default (computer science)Patch (Unix)Lecture/ConferenceJSONXMLUML
Lecture/Conference
Transcript: English(auto-generated)
Francesc Altet will talk about out-of-core columnar databases.
He is a creator of PyTables, a developer of Blaze, and a performance enthusiast. Give him a warm welcome, please. So, thank you very much, Oliver, for the introduction.
So, in my talk today, I am going to introduce you to out-of-core columnar datasets. And in particular, I will be introducing bCalls, which is a new data container that supports in memory on disk columnar chunk a compressed data.
bCalls seems like a strange name, but you can think of it like a b columnar. And the final LZ stands for Lempel CIF codex, which bCalls use a lot internally. Okay, so, just a plug about me.
I am the creator of tools like PyTables, Blosk, now bCalls, and I am a long-term maintainer of NumExp, which is a package for evaluating NumPy expressions very quickly. I am an experienced developer and trainer in Python, because I have almost 15 years of experience coding full-time in Python.
And then, I love high-performance computing and storage as well. So, I am also available for consulting. So, what? We have another data container, right? So, yeah. In my opinion, we are bound to live in a world
of wildly different instances of data containers. The NoSQL movement is an example of that. We have a wide range of different databases and data containers, even in Python. And why? This is mainly because of the increasing gap between CPU and memory speeds.
If you understand this fact, you will understand why this is so important. So, the evolution of the CPUs, it's clear that the CPUs are getting much more faster than memory speed. And this is creating a gap between memory access,
and the CPU is mostly doing nothing most of the time. And that has a huge effect in how you access your data containers. If you want more details, you can see my article, why modern CPUs are starving and what you can do about it.
So, why columnar? Well, when you are querying tabular data, only the interesting data is accessed. So, that basically means less input-output required. And this is very important when you are trying to get maximum speed.
So, let me show you an example of that. Let's suppose that we have an in-memory row-wise table. This is the typical structured array in NumPy. It is stored like this. So, for example, if you are doing a query, the interesting column is the second one, the integer 32-1.
So, due to how computers work with memory, you are not accessing only the interesting column, but you are accessing also the bytes next to this column. This is for architectural reasons, right? So, typically, if this is in memory,
you are not bringing to the CPU just n rows multiplied by 4 bytes, but you are bringing to the caches n multiplied by 64 bytes. And 64 is because it's the size of the cache line, typically, in modern CPUs.
So, we are bringing 10 times more data than is strictly necessary. In the column-wise approach, if you store the data in the same column sequentially, you will be only bringing to the cache
the exact amount of information that you need. So, this is the rationale behind why column-wise tables are interesting. Now, why chunking? So, chunking means that you store your data in different chunks,
not in a monolithic container, but that means more difficulty handling that data, right? So, why bother? Well, the fact is that chunking allows efficient enlarging and shrinking of your datasets, and also makes on-flight compression possible. So, let me put you an example.
When we want to append data in a NumPy container, for example, we need to copy, we need to reserve, to do a malloc in a new location, then to copy the original data in the original array, and then append, finally copy the data to append at the end of the new area.
So, this is extremely inefficient because of this gap between the CPU and memory. Now, the way to append data in bcalls, bcalls is chunked, okay? So, if we want to append the data, we only have to compress the data
because bcalls is compressed by default, the bcalls containers, and then you don't need the additional copy, okay? Because basically what you're doing is adding the new chunk of chunks to the initial list, okay? So, it's very efficient.
And finally, why compression? Well, the first reason for compression is that more data can be stored in the same amount of data, of media, sorry. So, if you have your original dataset and your dataset is compressible, and let's say that you have a compression ratio, you can reach a compression ratio of 3x, you can store three times more data using the same resources,
which is great, but this is not the only reason. Another reason is that if you deal with compressed datasets in memory, for example, on disk, whatever, and you have to do your computations, typically they execute in the CPU cache,
you will need to transfer less information if your data is compressed in memory, okay? And that could be a huge advantage. Now, if the transmission time of transmitting the compressed data from the memory or the disk to the cache, plus the decompression time,
we can do that time, the sum, less than the time that it takes the original dataset to be transferred to the cache, then we can accelerate as well computations, okay? This is our second goal. And for that you need an extremely fast compressor, okay?
So BLOSC is one of these compressors, okay? The goal of BLOSC is bringing data much faster than a MEMCPY memory copy can work, okay? Here is an example where the MEMCPY is reaching a speed of seven bytes per second,
and then BLOSC can reach a performance of 35 gigabytes per second. So BLOSC would be interesting to be used in big goals, and in fact it is part of big goals. So goals and implementation.
One important thing, an essential thing I would say in BLOSC, in big goals, sorry, is that it is driven by the keep it simple, stop it principle, in the sense that we don't want to put a lot of functionality on top of it, we just want to create a very simple container,
a very simple iterators on top of it. So what big goals is exactly? So as I said before, it's a columnar chunk, compressed data containers for Python. It offers two flavors for containers, the first one is C array and the other is C table, and it uses the power for BLOSC compression library
for on-the-fly compression, the compression. And it's 100% right in Python, and also Cython for accelerating the interesting parts. So for example, the C array container, which is one of the flavors of big goals, is just a multidimensional data container for homogeneous data.
So it's basically the same concept that NumPy, but all the data is split in chunks, just to allow this easy to append. And also to allow compression as well. So the C table object is basically a dictionary of C arrays.
It's very simple, but as you can see, the chunks follow the column order. So queries following on several columns will fetch only the necessary information. And also, adding and removing columns is very cheap,
because it's just a matter of inserting and deleting entries in a dictionary, Python dictionary. So, persistency. C array and C table can live not only in memory, but also on disk.
And for doing that, the format that has been chosen by default is heavily based in BLOSC pack, which is a national library for compressing large data sets that Valeti Hennel has been working on for the past years. And tomorrow, on Sunday, he will be giving a talk on the PyData conference.
So, big goals, and the goal of big goals is to allow every operation to be executed entirely on disk. So this persistent thing allows big goals, operations to be executed entirely on disk. And that means that all the operations
that you can do with objects in memory can also be done on disk. So you can add very large data sets that cannot fit on memory. You can do these operations on disk. Or even queries. So the way to do analytics with big goals
is, as I said before, big goals strives to be simple. So, big goals, basically, it's a data container with some iterators on top of it. And there are two flavors of iterators, then iter and where,
where it's the way to filter data, for example. And there is the blocked version of the iterators, where instead of receiving one single element, you will receive a block of elements. Because, in general, it's much more efficient to receive blocks and to work with blocks. And on top of that, the idea is that you use the iter tools,
for example, in the standard library, in the standard Python library, to use these building blocks. Or, if you need more machinery, you can use the ptools, the excellent ptools on scifools packages in order to apply maps, filters,
group by, sort by, reduce by, joins, whatever. On top of that. This is the philosophy of big goals. Also, I recently implemented big goals. If you cannot create big goals from existing data containers, then you are lost. So, I created interfaces
with the most important packages when you are talking about big data. So, for example, by default, big goals have been always based on NumPy, but there is also support for PyTables. So, for example, you can do index queries, for example,
using PyTables. Just store big goals and produce SDF5 files with that. But also, you can create, you can import and export data frames very easily from Pandas. That gives you access to all these backends as well. Okay, so let me finish my talk
with some benchmarks with real data. And in particular, I will be using the MovieLens dataset. And you can find all the materials for the plots that I am going to show in this repository.
So, let me show you the notebook. Basically, what I did is a notebook. So, this is the notebook, okay, that you can find in the repo. And here is all the parsing, processing, everything, and here are the results.
So, you can go to this repository and reproduce the results by yourself, if you like to. Reproducibility is very important, as you know. So, the MovieLens dataset,
it's basically people that rate movies, and that there's a group of people that collected these ratings and created different datasets. There are three interesting datasets, one with 100,000 ratings, one million, and ten millions. The numbers that I am going to show
are the biggest one, the ten million ratings. So, this is the way to query the MovieLens dataset. So, typically, what I am doing here is using Pandas, basically, for reading the CSV files, and then produce a huge data frame
containing all the information from the data files. Then, the way to query in Pandas is like in the recent versions of Pandas, you can use the .query, which allows you to use this simple way to query the data frame.
And, for example, in the vcalls ctable from data, I import the data frame and create a new container, which is a vcalls container. It's a ctable container, okay? And then, this ctable container
is squared through the where iterator, as I said before, okay? So, you can pass exactly the same query than Pandas. In fact, these queries are using NumX behind the scenes, okay, so they are very fast. And then, you are selecting, you are saying to the iterator that we are interested
just in the user ID field for the query. So, here we have a view of the sizes of the datasets. It turns out that this dataset is highly compressible. So, we can see that Pandas takes around
a bit more than one gigabyte and a half. And, the vcalls container for the same data frame, it's a bit larger, in fact, without compression. But, if you apply compression, your size or the size of the dataset will be reduced to less than 100 megabytes.
So, that's a factor of almost 20 times, okay? So, that's very interesting. But, perhaps the most interesting thing about that is the query times, okay? So, Pandas, you know Pandas because it's extremely
fine-tuned for getting high-performance queries, right? It's, in fact, Pandas, the data frame, it's column-oriented, it's column-wise container. In memory, as well. So, it's a perfect match for doing a comparison. So, the time that it takes Pandas for doing this operation,
this query, is a little bit more than half a second. And, for the vcalls without compression, we can see that the time, it's like maybe 60% less, or something like that. And, the most compelling thing, in my opinion,
is that when you are doing the same query, but with using the compressed container, the time that it takes is less than using the compressed container. And, this is essentially because the time that it takes to bring the data compressed into the CPUs is much less than the time that it takes
to bring the data uncompressed, okay? So, the last, the upper row, the upper bar, means that vcalls is on disk, okay? But, using compression, it is a little bit slower than in memory case, but it's still faster than Pandas.
And, this is probably due to the fact that the vcalls container, although it is stored on disk, the operating system probably has already cached that in memory, right? So, it has a little bit more overhead
because of the file system overhead, but the speed is very nice as well. So, this has not been always the case. So, for example, when I rerun the benchmark in a laptop which is three years old, for example,
which is the one that I'm using for the presentation, MacBook Air, we can see that Pandas is the fastest, okay? Then, when vcalls is a little bit slower, but when you're using the compressed container, it has an overhead.
This is because block is not as efficient running in all architectures. I mean, new CPUs are very fast compared with older ones. And, that gap, I mean, that increase that we are seeing here in my other laptop, my Linux box, we are going to see this kind of speedups more and more in the future.
So, compression will be very important, in my opinion, in the future. So, let me finish with some status and overview of vcalls. I released version 0.7.0 this week,
so you need to check it out. So, we are focused on refining on the API and tweaking knobs for making things even faster. We are not interested in developing new features, probably, but just in making the containers much faster and also the iterators.
Also, we need to address better integration with blockpack. I am in contact with Valentin in order to implement what we call superchanks. So, every chunk, right now, it's a file on the file system when you are using persistency. And, when you have a lot of chunks, that means that you are wasting a lot of high nodes, okay?
So, the idea is to tie together different chunks and to create these superchanks in order to avoid this overhead. And, the main goal of vcalls is to demonstrate that compression can help for performance, even using in-memory data containers.
And, that's very important because, I mean, I produced blocks like five years ago, and although my perception was that compression would help in this area, just five years later is when I am starting to see actual results with real data.
This promise is fulfilled. So, we would like you to tell us about your experience. So, if you are using vcalls, tell us about your scenario. If you are not getting the expected speedup or compression ratio, please tell us.
You can write to the mailing list there, or you can always send bugs, patches. Please file them in the GitHub repository. You can have a look at the manual, which is online, vcalls.blosk.org.
Then, you can have a look at the format that is using vcalls by default. And, the whole Blosk ecosystem lives in blosk.org. So, thank you. And, if you have any questions, I will be glad.