Add to Watchlist

Out-of-Core Columnar Datasets

1 views

Citation of segment
Embed Code
Purchasing a DVD Cite video

Formal Metadata

Title Out-of-Core Columnar Datasets
Title of Series EuroPython 2014
Part Number 79
Number of Parts 120
Author Alted, Francesc
License CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
DOI 10.5446/19974
Publisher EuroPython
Release Date 2014
Language English
Production Place Berlin

Content Metadata

Subject Area Computer Science
Abstract Francesc Alted - Out-of-Core Columnar Datasets Tables are a very handy data structure to store datasets to perform data analysis (filters, groupings, sortings, alignments...). But it turns out that how the tables are actually implemented makes a large impact on how they perform. Learn what you can expect from the current tabular offerings in the Python ecosystem. ----- It is a fact: we just entered in the Big Data era. More sensors, more computers, and being more evenly distributed throughout space and time than ever, are forcing data analyists to navigate through oceans of data before getting insights on what this data means. Tables are a very handy and spreadly used data structure to store datasets so as to perform data analysis (filters, groupings, sortings, alignments...). However, the actual table implementation, and especially, whether data in tables is stored row-wise or column-wise, whether the data is chunked or sequential, whether data is compressed or not, among other factors, can make a lot of difference depending on the analytic operations to be done. My talk will provide an overview of different libraries/systems in the Python ecosystem that are designed to cope with tabular data, and how the different implementations perform for different operations. The libraries or systems discussed are designed to operate either with on-disk data ([PyTables], [relational databases], [BLZ], [Blaze]...) as well as in-memory data containers ([NumPy], [DyND], [Pandas], [BLZ], [Blaze]...). A special emphasis will be put in the on-disk (also called out-of-core) databases, which are the most commonly used ones for handling extremely large tables. The hope is that, after this lecture, the audience will get a better insight and a more informed opinion on the different solutions for handling tabular data in the Python world, and most especially, which ones adapts better to their needs.
Keywords EuroPython Conference
EP 2014
EuroPython 2014
Series
Annotations
Transcript
Loading...
Francisco I will talk about of course databases use of trade occupied table it's a developer of place and the performance of the easiest given the war of complete so thank you very much cleverer interaction so my talk today is going to use you to all the core problem that datasets and in particular I will be introducing the calls there which is so new data container that the supporting their money on this could people not to compress the data because seems like constraints name about that you can think of it that like the people and the final LC stands for Lempel-Ziv context which should be close to use a lot internally OK so just a plot of a domain indicators of of tools like like tables was now the calls and I am a long time and maintain a from X which is buckets for waiting by expression but quickly I am an expert is available in FrameNet in Python was like for
almost 15 years of experience calling people dying by Python and then by law high-performance computing and storage as well so I am also 1 of 4 consulting so what we
have another bit container right so you may have been and we are bound to live in a world of widely different instances of the the
containers the nautical movement is an example of that we have a wide range of different databases and they the containers even in Python and why this is mainly because of the increasing gap between CPU memories that give you understand this factor you you will understand why this is so important so the evolution of the CPU is is so it's clear that this abuse editable out getting much more faster the memory in space and this is taken as a gap between memory access and the CPU is mostly doing nothing most of the time and that has a huge effect in cold you access your data combining the 3 or more you conceive my article was my mother's abuse understanding what you can do about them about it so I columnar 12 when you're doing when you're Quintela later on in the de-listing they that use axis the success so that means basically means less simple doubled required OK and this is very important when you're trying to get the maximum speed so let me show you an example that but suppose that we have the memory of a wise stable this is the typical regard as the operating by start life like this so for
example if you're doing equation you have been there some Is this the 1 making the different 1 so due to how computers work uh with with memory you will will you are not accessing only the caller indistinct but you're accessing also that by next to the skull this is what architectural reasons right so typically if this is a memory limit you're not accessing general bringing to this if you just end rows multiplied by 4 bytes but your accessing your being in the to to the to the fascists and multiply by 64 bytes and 64 is because the that the size of the guys like typically in monitored which occur so we have we have been given 10 times more data than is necessary in Cologne in the column wise approach we just part of the data that that's missing columns sequentially you you will you will be being in the to the guys this that the exact amount of information that you need so this is
the rationale behind the white the columnwise things tables are
binders now why so Jenkins means that you story in your data in different giants nothing new in a monolithic environment but that means more difficulty finding that data right so why bother well the fact is that Jenkins allow sufficient margin and shrinking or your data sets and also makes you flights comprised of flight compression portion so let me put give you an
example when we went to happen
that data in and by container for example we need to copy and we need to do set to do a monolog in a new location then to copy the original data in the original and then at happened of finally copy the data block and at the end of the name of the new area so this is extremely inefficient because of these gaps of between the CPU and memory now the way to happen they
kind because because it's Jack OK so you want to happen the data that we only have to compress the data because because compressed by default that would because the because company and then you don't need that
he should have copies OK because you basically what what you're doing is adding the new challenge of chance to their
initial list because it is very efficient and finally why
compression while the 3rd reason for compression is that more data can be stored in the same amount of data of many so
you have your originally set and you did this it is compressible let's say that you have a compression ratio newcomer to compress reversible of 3 X you better start 3 times more data using the same resources which is very but this is not the only reason the reason is that if you will deal with compress density memory for example on the them and you have to do your computation be the
executing the in in this if you guys you will need to transfer less information if you're compressed the the if you that this compressed in memory of and that would be I could have huge event it's now if the transmitter is the transmission time of was made in the compressed data from the memory over the years to the gas blast the compression time we can do that at at that time the some less than the time that it takes the state to be transferred to this food to the cache and then we can accelerate as well competition OK this is the 2nd and for that you need an
extremely fast compression in case of loss is 1 of these compresses the the Boston if the goal loss is being in data and much faster than MMC BY memory copy it down column there is an example where are the means to be why is taking its original people for 7 year at the bytes per 2nd and then lost and then reach a performance of 35 yards person so most of it will be
interesting to be used in the calls in the fact that it is part of because so both an implementation 1 important thing essential thinkable blast in because of it is that it is riven by the keep it simple stop the principle in the sense that we don't want to put a lot of unfortunately don't it we just want to create seen a very simple contain a very simple lady doesn't above it so
what because he's exactly so as I said before it's a column and compress the the containers for Python it offers for containers the first one is your age and the others see Table and it uses the following the lost compression library for them the fly compression the compression and is wife person right in Python analysis I for accelerating the interesting part so for
example this year I container which is 1 of the flare also because is that there is just the
multidimensional data can benefit from a they don't so it's basically the same course of the entire but all the all the databases with a pinch and just want to dwell well all this is problem and also follows compression as well so this is object is basically a dictionary of the arrays is
very simple what but as you can see did they chance for all the common law and so waste followed their on only 1 1 7 0 of 0 columns will
fit all the data from the necessary information and also at adding up all the moving across very cheap because it's just a matter of inserting and
deleting entries in and in addition Iceland so persistency
as seriously tables can leave not only in memory but also the of the of that the format that has been chosen chosen by default it's so heavily based in the last 5 OK which is a massive library for compressing late to a last
datasets that validate panel has been working on for the last for the past 2 years and tomorrow and Sunday it will be giving a talk on and the by the the conference so because of
there and that the goal the closest to a every operations to be executed and directly on these so this this is to think of because there operations which include in general entirely on this and that means that all the all the places that you can do with objects even when it can also be done on this OK so you can add very large datasets that cannot close fit on on this on memory you can do that these operations or inquiries OK so all the way to do analytics with the people the user as I said before because Men's strives to be simple OK so that in because basically it's at the the container with some the data on top of it and there are 2 flavors of a data that then it there and where would say where where it's the way to feel that later for example and then there is the blocked version of the generators where instead of receiving 1 single element you will receive a block of elements because in general is much more efficient to receive blocks and the what we blocks and on top of that the ideas that you use their tools input symbol in the standard library by the in the understand by library to use these so these with the blocks or if you need more machinery you can use the the tools this next by tools insightful site that's a cycles packages in order to apply maps field there's no bias abided divide joint whatever on top of that this is the philosophy of vehicles also I recently implemented because if you cannot create calls from existing that the containers then you you have lost so created interfaces with that with that with the most important package is when you are talking about the data so for example and by default but because has been always based in unknown by but there is also support for by tables so for example you can do index of ways for example using by they will just a start the course and produces as the A 5 files with that but also you can create event import and export data frames very easily and from under that we give you access so to hold these elections as well OK so let me finish my my talk about this with some match with the with
real data and in particular I believe I will be using the movie lens dataset and you can find all the materials for their for the loss that they going to show in this in this reported earnings from the mean show you the notebook was eagerly what quite a bit is so a another book so this is the
notebook OK but you can find in the report and it's all the question processing and of thing and here are the results so you can you better access to this to the goal well to visible body and produce the results by yourself if you like to but
is it it is very important as you know so the
MovieLens dataset there it's basically people that operates movies and that and then there's a group of people that collected these ratings and create a different different datasets there are 3 it does in datasets 1 win 100 thousand of radiance 1 million and the million so the numbers that they're going to show what it because 1 that the million ratings so this is the
way to wedding MovieLens dataset so the big whether you are here is used by NASA bicycling for retrieving the CSV files and then produce a big huge they different containing all the information from the data files I think there's no way to to query in Banda Aceh selective indifference versions upon that and used it that way which allows you to use this in this simple way to queries the different and for example in there because the table from data I imports the data frame and create a new container which is so the calls container to see Table combined OK and then there's this is this to Table container is where it's through where the data processing before OK so you can pass exactly the same of this inquiry number and that's in fact this quasi using mixed behind-the-scenes because of the very 1st and then you are selecting your saying to the to the data that we are interested just in the use FIL for the course so here
we have the view of the sizes of the datasets brings out that because this is what is highly compressible so we can see that this mean means takes around the world a bit more than 1 gigabyte and half and they because container for the same data frame it's a bit it's a bit the larger in fact we compression that give you apply compression you could you your size or the size of the dataset will be reduced to less than 100 megabytes so that's so that's a factor of almost 20 times OK so that's very interesting but because
the most interesting thing about this is the query times at the so by and that do not punish because it's extremely thank you for getting high-performance Square is right it's in fact there's the the data-frame it's a column oriented doses column-wise contain any memory as well so it's it's a perfect match for doing a comparison so the time that it expanded for doing this this operation the squared is that a little bit more than half a 2nd and for because without compression we can see that the the time it's like 0 maybe 60 per cent of those letters or something like that and the most compelling think in my opinion is that when you are doing the same way that we are using the compressed combine at the time that it takes is less than using the the compressed content and is based essentially because the time that it takes to bring the data compressed into the CPU's is much less than the time that it takes to bring the data uncompressed so and in the last day of the role of the upper parent body it means that because this is from this OK but using compression it is a little bit slower than the memory and that in many cases but this is still faster than and this is probably due to the fact that that the because containing it's although it is starting on the is the system probably has already passed that in memory right so it has a little bit more aware that because of the file system of effect but this this the speed is very very nice so
this has not been always the case so for example when I ran around the basement in a lot of which is 3 years old for example which is the 1 that the user from the presentation MacBook Air you would we can see that by and this is the fastest again there when the call is this a little bit of
lower but would you just in the compressed container has
unaware that this is what this is because we lost takes is not as efficient learning in all architecture I mean new CPU's at a very fast compared with all the ones and then gap that view that we're seeing
here anyone 1 in might all the laptop meaning of we we are going to see this thank of speedups more and more in the future so compression will be very important In my opinion in the
future so let me finish with the
sum is that those an overview of because release versions of 7 but 0 this week so you know you need to check it out so we have focused on
defining on the API and tweaking knobs for making things even faster we have not invested in developing new features for probably but just in making the containers much faster and also that the characters also we need to address that integration with respect them in contact with in order to implement what we called what we call a super giants so every tank by now it's a file the devices than when you're using consistency and when you when you have a lot of chances that means that you're wasting a lot of notes OK so they the H 2 to tie together different times and to create this super attention to
avoid the source and the main goal because is to demonstrate that the compression that helps for performance even using a memory that didn't and that's that's
really important because I mean
produced lost like a 5 years ago and otherwise my perception was that the compression would help in this area just 5 years later is when I starting to see the actual results with real data that this is this from ice is fulfilled so we would
like to you to tell us about your experience so we viewed it using the calls us about your scenario you're not getting the spectators speedup or compression ratio please tell us you can you can write to the
mailing list there or you can always send blacks I just please fill file them in the fact that what during you're going to have
a look at the of the manual which is aligned because of lost the plot and then you can have a look at this and therefore by the format that is using because we default prospect and the blast the call the whole block consistently of symbols that
so thank you and if you have any questions I would
Domain name
Context awareness
Constraint (mathematics)
Software developer
Expression
Expert system
Interactive television
Core dump
Price index
System call
Table (information)
Computer animation
Database
Core dump
Software developer
Physical law
Bit
Instance (computer science)
Term (mathematics)
Software maintenance
Supercomputer
Hochleistungsrechnen
Computer animation
Read-only memory
Data storage device
Data storage device
Central processing unit
Read-only memory
Divisor
Multiplication sign
Simultaneous localization and mapping
Range (statistics)
Mereology
Table (information)
Maxima and minima
Video game
Read-only memory
Operator (mathematics)
Database
Central processing unit
Subtraction
Stability theory
Multiplication
Spacetime
Information
Twin prime
Computer
Evolute
Limit (category theory)
Principle of maximum entropy
Computer animation
String (computer science)
Equation
Row (database)
Reading (process)
Dialect
Set (mathematics)
Table (information)
Table (information)
Computer animation
Integrated development environment
Read-only memory
Data compression
String (computer science)
Data compression
Glass float
Subtraction
Marginal distribution
Area
Read-only memory
Uniform resource locator
Resource allocation
Computer animation
Read-only memory
Block (periodic table)
Object (grammar)
Default (computer science)
Computer animation
Military operation
Automaton
Read-only memory
Carry (arithmetic)
Population density
Hypermedia
Computer animation
Data compression
Computer
Multiplication sign
Data compression
Electronic mailing list
3 (number)
Multiplication sign
Read-only memory
State of matter
Multiplication sign
Computer-aided design
Insertion loss
Event horizon
Data transmission
Cache (computing)
Arithmetic mean
Computer animation
Read-only memory
Data compression
Personal digital assistant
Data compression
Transmissionskoeffizient
Central processing unit
Curve fitting
Inclusion map
Implementation
Computer animation
Data compression
Data compression
Mathematical analysis
Implementation
Mereology
System call
Library (computing)
Table (information)
Library (computing)
Array data structure
Carry (arithmetic)
Evelyn Pinching
Computer animation
Data compression
Database
Object (grammar)
Data dictionary
Online chat
Order (biology)
Carry (arithmetic)
Computer animation
Information
Object (grammar)
Physical law
Fitness function
Addition
Default (computer science)
Read-only memory
Order (biology)
Carry (arithmetic)
Computer animation
File format
Object (grammar)
Library (computing)
Table (information)
Read-only memory
Divisor
Computer file
File format
Event horizon
Field (computer science)
Revision control
Read-only memory
Object (grammar)
Military operation
Operator (mathematics)
Alphabet (computer science)
Data acquisition
MiniDisc
Library (computing)
Operations research
Default (computer science)
Electric generator
Mapping
Block (periodic table)
Element (mathematics)
Analytic set
Interface (computing)
Core dump
Streaming media
System call
Frame problem
Table (information)
Subject indexing
Computer animation
Order (biology)
Website
Object (grammar)
Cycle (graph theory)
Matching (graph theory)
Library (computing)
Laptop
Frame problem
Overhead (computing)
Electronic data interchange
Process (computing)
Real number
Datenpfad
Materialization (paranormal)
Insertion loss
Bit
Total S.A.
Virtual memory
Summation
Pointer (computer programming)
Regular graph
Benchmark
Computer configuration
Computer animation
Object (grammar)
Motion blur
Addressing mode
Traffic reporting
Resultant
Benchmark
Frequency
Computer animation
Bit rate
Real number
Local Group
Flux
Number
Metropolitan area network
Electronic data processing
Spacetime
Computer file
Information
Divisor
View (database)
Multiplication sign
Bit
System call
Frame problem
Number
Table (information)
Revision control
Arithmetic mean
Computer animation
Query language
Data compression
Selectivity (electronic)
Multiplication sign
Pairwise comparison
Read-only memory
Query language
Inheritance (object-oriented programming)
Presentation of a group
Multiplication sign
Content (media)
Sound effect
Bit
System call
Dressing (medical)
Computer animation
Query language
Data compression
Personal digital assistant
Operator (mathematics)
File system
Square number
Central processing unit
Right angle
Matching (graph theory)
Laptop
Physical system
Multiplication sign
Query language
Computer animation
1 (number)
Central processing unit
Laptop
Computer architecture
Laptop
Multiplication sign
Query language
Arithmetic mean
Computer animation
Data compression
Data compression
Laptop
Maxima and minima
Revision control
Summation
Inheritance (object-oriented programming)
Computer animation
Computer file
INTEGRAL
Consistency
Multiplication sign
Disintegration
Order (biology)
Revision control
Focus (optics)
Area
Read-only memory
Inheritance (object-oriented programming)
Real number
Disintegration
Source code
Electronic mailing list
Focus (optics)
Computer animation
Data compression
Revision control
Data compression
Resultant
Email
Computer animation
Data compression
Electronic mailing list
Electronic mailing list
System call
Computer animation
File format
Block (periodic table)
System call
Symbol table
Computer animation
Loading...
Feedback

Timings

  514 ms - page object

Version

AV-Portal 3.8.0 (dec2fe8b0ce2e718d55d6f23ab68f0b2424a1f3f)