Merken

Out-of-Core Columnar Datasets

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Beta
Erkannte Entitäten
Sprachtranskript
Francisco I will talk about of course databases use of trade occupied table it's a developer of place and the performance of the easiest given the war of complete so thank you very much cleverer interaction so my talk today is going to use you to all the core problem that datasets and in particular I will be introducing the calls there which is so new data container that the supporting their money on this could people not to compress the data because seems like constraints name about that you can think of it that like the people and the final LC stands for Lempel-Ziv context which should be close to use a lot internally OK so just a plot of a domain indicators of of tools like like tables was now the calls and I am a long time and maintain a from X which is buckets for waiting by expression but quickly I am an expert is available in FrameNet in Python was like for
almost 15 years of experience calling people dying by Python and then by law high-performance computing and storage as well so I am also 1 of 4 consulting so what we
have another bit container right so you may have been and we are bound to live in a world of widely different instances of the the
containers the nautical movement is an example of that we have a wide range of different databases and they the containers even in Python and why this is mainly because of the increasing gap between CPU memories that give you understand this factor you you will understand why this is so important so the evolution of the CPU is is so it's clear that this abuse editable out getting much more faster the memory in space and this is taken as a gap between memory access and the CPU is mostly doing nothing most of the time and that has a huge effect in cold you access your data combining the 3 or more you conceive my article was my mother's abuse understanding what you can do about them about it so I columnar 12 when you're doing when you're Quintela later on in the de-listing they that use axis the success so that means basically means less simple doubled required OK and this is very important when you're trying to get the maximum speed so let me show you an example that but suppose that we have the memory of a wise stable this is the typical regard as the operating by start life like this so for
example if you're doing equation you have been there some Is this the 1 making the different 1 so due to how computers work uh with with memory you will will you are not accessing only the caller indistinct but you're accessing also that by next to the skull this is what architectural reasons right so typically if this is a memory limit you're not accessing general bringing to this if you just end rows multiplied by 4 bytes but your accessing your being in the to to the to the fascists and multiply by 64 bytes and 64 is because the that the size of the guys like typically in monitored which occur so we have we have been given 10 times more data than is necessary in Cologne in the column wise approach we just part of the data that that's missing columns sequentially you you will you will be being in the to the guys this that the exact amount of information that you need so this is
the rationale behind the white the columnwise things tables are
binders now why so Jenkins means that you story in your data in different giants nothing new in a monolithic environment but that means more difficulty finding that data right so why bother well the fact is that Jenkins allow sufficient margin and shrinking or your data sets and also makes you flights comprised of flight compression portion so let me put give you an
example when we went to happen
that data in and by container for example we need to copy and we need to do set to do a monolog in a new location then to copy the original data in the original and then at happened of finally copy the data block and at the end of the name of the new area so this is extremely inefficient because of these gaps of between the CPU and memory now the way to happen they
kind because because it's Jack OK so you want to happen the data that we only have to compress the data because because compressed by default that would because the because company and then you don't need that
he should have copies OK because you basically what what you're doing is adding the new challenge of chance to their
initial list because it is very efficient and finally why
compression while the 3rd reason for compression is that more data can be stored in the same amount of data of many so
you have your originally set and you did this it is compressible let's say that you have a compression ratio newcomer to compress reversible of 3 X you better start 3 times more data using the same resources which is very but this is not the only reason the reason is that if you will deal with compress density memory for example on the them and you have to do your computation be the
executing the in in this if you guys you will need to transfer less information if you're compressed the the if you that this compressed in memory of and that would be I could have huge event it's now if the transmitter is the transmission time of was made in the compressed data from the memory over the years to the gas blast the compression time we can do that at at that time the some less than the time that it takes the state to be transferred to this food to the cache and then we can accelerate as well competition OK this is the 2nd and for that you need an
extremely fast compression in case of loss is 1 of these compresses the the Boston if the goal loss is being in data and much faster than MMC BY memory copy it down column there is an example where are the means to be why is taking its original people for 7 year at the bytes per 2nd and then lost and then reach a performance of 35 yards person so most of it will be
interesting to be used in the calls in the fact that it is part of because so both an implementation 1 important thing essential thinkable blast in because of it is that it is riven by the keep it simple stop the principle in the sense that we don't want to put a lot of unfortunately don't it we just want to create seen a very simple contain a very simple lady doesn't above it so
what because he's exactly so as I said before it's a column and compress the the containers for Python it offers for containers the first one is your age and the others see Table and it uses the following the lost compression library for them the fly compression the compression and is wife person right in Python analysis I for accelerating the interesting part so for
example this year I container which is 1 of the flare also because is that there is just the
multidimensional data can benefit from a they don't so it's basically the same course of the entire but all the all the databases with a pinch and just want to dwell well all this is problem and also follows compression as well so this is object is basically a dictionary of the arrays is
very simple what but as you can see did they chance for all the common law and so waste followed their on only 1 1 7 0 of 0 columns will
fit all the data from the necessary information and also at adding up all the moving across very cheap because it's just a matter of inserting and
deleting entries in and in addition Iceland so persistency
as seriously tables can leave not only in memory but also the of the of that the format that has been chosen chosen by default it's so heavily based in the last 5 OK which is a massive library for compressing late to a last
datasets that validate panel has been working on for the last for the past 2 years and tomorrow and Sunday it will be giving a talk on and the by the the conference so because of
there and that the goal the closest to a every operations to be executed and directly on these so this this is to think of because there operations which include in general entirely on this and that means that all the all the places that you can do with objects even when it can also be done on this OK so you can add very large datasets that cannot close fit on on this on memory you can do that these operations or inquiries OK so all the way to do analytics with the people the user as I said before because Men's strives to be simple OK so that in because basically it's at the the container with some the data on top of it and there are 2 flavors of a data that then it there and where would say where where it's the way to feel that later for example and then there is the blocked version of the generators where instead of receiving 1 single element you will receive a block of elements because in general is much more efficient to receive blocks and the what we blocks and on top of that the ideas that you use their tools input symbol in the standard library by the in the understand by library to use these so these with the blocks or if you need more machinery you can use the the tools this next by tools insightful site that's a cycles packages in order to apply maps field there's no bias abided divide joint whatever on top of that this is the philosophy of vehicles also I recently implemented because if you cannot create calls from existing that the containers then you you have lost so created interfaces with that with that with the most important package is when you are talking about the data so for example and by default but because has been always based in unknown by but there is also support for by tables so for example you can do index of ways for example using by they will just a start the course and produces as the A 5 files with that but also you can create event import and export data frames very easily and from under that we give you access so to hold these elections as well OK so let me finish my my talk about this with some match with the with
real data and in particular I believe I will be using the movie lens dataset and you can find all the materials for their for the loss that they going to show in this in this reported earnings from the mean show you the notebook was eagerly what quite a bit is so a another book so this is the
notebook OK but you can find in the report and it's all the question processing and of thing and here are the results so you can you better access to this to the goal well to visible body and produce the results by yourself if you like to but
is it it is very important as you know so the
MovieLens dataset there it's basically people that operates movies and that and then there's a group of people that collected these ratings and create a different different datasets there are 3 it does in datasets 1 win 100 thousand of radiance 1 million and the million so the numbers that they're going to show what it because 1 that the million ratings so this is the
way to wedding MovieLens dataset so the big whether you are here is used by NASA bicycling for retrieving the CSV files and then produce a big huge they different containing all the information from the data files I think there's no way to to query in Banda Aceh selective indifference versions upon that and used it that way which allows you to use this in this simple way to queries the different and for example in there because the table from data I imports the data frame and create a new container which is so the calls container to see Table combined OK and then there's this is this to Table container is where it's through where the data processing before OK so you can pass exactly the same of this inquiry number and that's in fact this quasi using mixed behind-the-scenes because of the very 1st and then you are selecting your saying to the to the data that we are interested just in the use FIL for the course so here
we have the view of the sizes of the datasets brings out that because this is what is highly compressible so we can see that this mean means takes around the world a bit more than 1 gigabyte and half and they because container for the same data frame it's a bit it's a bit the larger in fact we compression that give you apply compression you could you your size or the size of the dataset will be reduced to less than 100 megabytes so that's so that's a factor of almost 20 times OK so that's very interesting but because
the most interesting thing about this is the query times at the so by and that do not punish because it's extremely thank you for getting high-performance Square is right it's in fact there's the the data-frame it's a column oriented doses column-wise contain any memory as well so it's it's a perfect match for doing a comparison so the time that it expanded for doing this this operation the squared is that a little bit more than half a 2nd and for because without compression we can see that the the time it's like 0 maybe 60 per cent of those letters or something like that and the most compelling think in my opinion is that when you are doing the same way that we are using the compressed combine at the time that it takes is less than using the the compressed content and is based essentially because the time that it takes to bring the data compressed into the CPU's is much less than the time that it takes to bring the data uncompressed so and in the last day of the role of the upper parent body it means that because this is from this OK but using compression it is a little bit slower than the memory and that in many cases but this is still faster than and this is probably due to the fact that that the because containing it's although it is starting on the is the system probably has already passed that in memory right so it has a little bit more aware that because of the file system of effect but this this the speed is very very nice so
this has not been always the case so for example when I ran around the basement in a lot of which is 3 years old for example which is the 1 that the user from the presentation MacBook Air you would we can see that by and this is the fastest again there when the call is this a little bit of
lower but would you just in the compressed container has
unaware that this is what this is because we lost takes is not as efficient learning in all architecture I mean new CPU's at a very fast compared with all the ones and then gap that view that we're seeing
here anyone 1 in might all the laptop meaning of we we are going to see this thank of speedups more and more in the future so compression will be very important In my opinion in the
future so let me finish with the
sum is that those an overview of because release versions of 7 but 0 this week so you know you need to check it out so we have focused on
defining on the API and tweaking knobs for making things even faster we have not invested in developing new features for probably but just in making the containers much faster and also that the characters also we need to address that integration with respect them in contact with in order to implement what we called what we call a super giants so every tank by now it's a file the devices than when you're using consistency and when you when you have a lot of chances that means that you're wasting a lot of notes OK so they the H 2 to tie together different times and to create this super attention to
avoid the source and the main goal because is to demonstrate that the compression that helps for performance even using a memory that didn't and that's that's
really important because I mean
produced lost like a 5 years ago and otherwise my perception was that the compression would help in this area just 5 years later is when I starting to see the actual results with real data that this is this from ice is fulfilled so we would
like to you to tell us about your experience so we viewed it using the calls us about your scenario you're not getting the spectators speedup or compression ratio please tell us you can you can write to the
mailing list there or you can always send blacks I just please fill file them in the fact that what during you're going to have
a look at the of the manual which is aligned because of lost the plot and then you can have a look at this and therefore by the format that is using because we default prospect and the blast the call the whole block consistently of symbols that
so thank you and if you have any questions I would
Nebenbedingung
Expertensystem
Datenhaltung
Interaktives Fernsehen
Systemaufruf
Plot <Graphische Darstellung>
Kontextbezogenes System
Computeranimation
Metropolitan area network
Arithmetischer Ausdruck
Domain-Name
Code
Speicherabzug
Speicherabzug
Indexberechnung
Softwareentwickler
Tabelle <Informatik>
Bit
Softwareentwickler
Speicher <Informatik>
Ikosaeder
Hochleistungsrechnen
ROM <Informatik>
Gesetz <Physik>
Computeranimation
EDV-Beratung
Supercomputer
Speicher <Informatik>
Term
Zentraleinheit
Instantiierung
Tabelle <Informatik>
Videospiel
Nichtlinearer Operator
Subtraktion
Stabilitätstheorie <Logik>
Extrempunkt
Datenhaltung
Singularität <Mathematik>
Gleichungssystem
Computerunterstütztes Verfahren
Zentraleinheit
ROM <Informatik>
Teilbarkeit
Raum-Zeit
Computeranimation
Datensatz
Multiplikation
Spannweite <Stochastik>
Festspeicher
Mereologie
Evolute
Inverser Limes
Information
Tabelle <Informatik>
Randverteilung
Subtraktion
Menge
Reelle Zahl
ROM <Informatik>
Quellencodierung
Programmierumgebung
Computeranimation
Tabelle <Informatik>
Metropolitan area network
Wechselsprung
Flächeninhalt
Festspeicher
URL
p-Block
ROM <Informatik>
Computeranimation
Gruppoid
Default
Computeranimation
Portscanner
Hypermedia
Festspeicher
Mailing-Liste
Gruppoid
Computerunterstütztes Verfahren
Quellencodierung
Drei
Computeranimation
Dichte <Physik>
Caching
Einfügungsdämpfung
Sender
Datentransfer
Transmissionskoeffizient
ROM <Informatik>
Maskierung <Informatik>
Ereignishorizont
Computeranimation
Arithmetisches Mittel
Reduktionsverfahren
Festspeicher
Caching
Wärmeübergang
Quellencodierung
Zentraleinheit
Aggregatzustand
Mereologie
Programmbibliothek
Implementierung
Systemaufruf
Quellencodierung
Computeranimation
Implementierung
Tabelle <Informatik>
Analysis
Objekt <Kategorie>
Metropolitan area network
Datenhaltung
Pinching
Quellencodierung
Computeranimation
Data Dictionary
Array <Informatik>
Information
Ordnung <Mathematik>
Gesetz <Physik>
Computeranimation
Fitnessfunktion
Addition
Festspeicher
Programmbibliothek
Dateiformat
Ordnung <Mathematik>
Default
Computeranimation
Tabelle <Informatik>
Web Site
Rahmenproblem
Versionsverwaltung
Zeichenvorrat
Analytische Menge
Element <Mathematik>
ROM <Informatik>
Analysis
Computeranimation
Metropolitan area network
Speicherabzug
Programmbibliothek
Gruppoid
Default
Schnittstelle
Nichtlinearer Operator
Matching <Graphentheorie>
Systemaufruf
p-Block
Elektronische Publikation
Dateiformat
Teilbarkeit
Ereignishorizont
Mapping <Computergraphik>
Objekt <Kategorie>
Generator <Informatik>
Datenfeld
Automatische Indexierung
Festspeicher
Dreiecksfreier Graph
Mini-Disc
Ordnung <Mathematik>
Tabelle <Informatik>
Inklusion <Mathematik>
Resultante
Einfügungsdämpfung
Bit
Prozess <Physik>
Materialisation <Physik>
Singularität <Mathematik>
Computer
Benchmark
Computeranimation
Metropolitan area network
Notebook-Computer
Overhead <Kommunikationstechnik>
Plot <Graphische Darstellung>
Bildschirmsymbol
Reelle Zahl
Gravitationsgesetz
Verkehrsinformation
Logik höherer Stufe
Gruppenkeim
Zahlenbereich
Benchmark
Fluss <Mathematik>
Reelle Zahl
Bitrate
Computeranimation
Bit
Sichtenkonzept
Rahmenproblem
Versionsverwaltung
Systemaufruf
Zahlenbereich
Abfrage
Extrempunkt
Elektronische Publikation
Teilbarkeit
Computeranimation
Arithmetisches Mittel
Metropolitan area network
Trennschärfe <Statistik>
Datenverarbeitung
Information
Quellencodierung
Personal Area Network
Gammafunktion
Tabelle <Informatik>
Soundverarbeitung
Nichtlinearer Operator
Bit
Matching <Graphentheorie>
Systemaufruf
Abfrage
Paarvergleich
Physikalisches System
Kombinatorische Gruppentheorie
Zentraleinheit
Computeranimation
Metropolitan area network
Quadratzahl
Rechter Winkel
Festspeicher
Vererbungshierarchie
Dateiverwaltung
Notebook-Computer
Inhalt <Mathematik>
Mini-Disc
Quellencodierung
Expertensystem
Metropolitan area network
Notebook-Computer
Computerarchitektur
Zentraleinheit
Computeranimation
Expertensystem
Eins
Arithmetisches Mittel
Metropolitan area network
Notebook-Computer
Varianz
Notebook-Computer
Mini-Disc
Quellencodierung
Computeranimation
Expertensystem
Gewichtete Summe
Vererbungshierarchie
Desintegration <Mathematik>
Versionsverwaltung
Ordnung <Mathematik>
Elektronische Publikation
Versionsverwaltung
ROM <Informatik>
Widerspruchsfreiheit
Computeranimation
Integral
Resultante
Elektronische Publikation
Flächeninhalt
Reelle Zahl
Vererbungshierarchie
Desintegration <Mathematik>
Festspeicher
Quellcode
Bitrate
ROM <Informatik>
Versionsverwaltung
Quellencodierung
Computeranimation
Elektronische Publikation
Systemaufruf
Mailing-Liste
E-Mail
Computeranimation
Dateiformat
Systemaufruf
Vorlesung/Konferenz
Plot <Graphische Darstellung>
Symboltabelle
p-Block
Computeranimation
Computeranimation

Metadaten

Formale Metadaten

Titel Out-of-Core Columnar Datasets
Serientitel EuroPython 2014
Teil 79
Anzahl der Teile 120
Autor Alted, Francesc
Lizenz CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
DOI 10.5446/19974
Herausgeber EuroPython
Erscheinungsjahr 2014
Sprache Englisch
Produktionsort Berlin

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract Francesc Alted - Out-of-Core Columnar Datasets Tables are a very handy data structure to store datasets to perform data analysis (filters, groupings, sortings, alignments...). But it turns out that how the tables are actually implemented makes a large impact on how they perform. Learn what you can expect from the current tabular offerings in the Python ecosystem. ----- It is a fact: we just entered in the Big Data era. More sensors, more computers, and being more evenly distributed throughout space and time than ever, are forcing data analyists to navigate through oceans of data before getting insights on what this data means. Tables are a very handy and spreadly used data structure to store datasets so as to perform data analysis (filters, groupings, sortings, alignments...). However, the actual table implementation, and especially, whether data in tables is stored row-wise or column-wise, whether the data is chunked or sequential, whether data is compressed or not, among other factors, can make a lot of difference depending on the analytic operations to be done. My talk will provide an overview of different libraries/systems in the Python ecosystem that are designed to cope with tabular data, and how the different implementations perform for different operations. The libraries or systems discussed are designed to operate either with on-disk data ([PyTables], [relational databases], [BLZ], [Blaze]...) as well as in-memory data containers ([NumPy], [DyND], [Pandas], [BLZ], [Blaze]...). A special emphasis will be put in the on-disk (also called out-of-core) databases, which are the most commonly used ones for handling extremely large tables. The hope is that, after this lecture, the audience will get a better insight and a more informed opinion on the different solutions for handling tabular data in the Python world, and most especially, which ones adapts better to their needs.
Schlagwörter EuroPython Conference
EP 2014
EuroPython 2014

Zugehöriges Material

Ähnliche Filme

Loading...