Merken

Data Formats for Data Science

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Beta
Erkannte Entitäten
Sprachtranskript
so all time at Betelgeuse will makes bigger to you but area manager on these will be k times yes please give a big welcome to with area OK the kind of the morning everyone and thank you very much for coming but to raise the performance for data science very weak slide made I am not where I was what and can currently in the complex they cannot be
analytics units I'm interested in the machine learning and text data processing and recently Our with the evidence the with the learning stuff like that I'm a fella Platonist this is states and I'm 1 of the main organizer of the 582 that's why I ask you to 1 interested in uh to to check out we have to take account and we have a couple of confidence in the last 2 years 1 of the years influence in together with here together with Python Italia was point we have a lot of money to please check out if you're interested I don't think that's worthwhile
mentioning are in no longer and that will be used by find this year and it will be by the at the end of August and actually the only bird uh tickets
is going and today actually voted definitely worthwhile city streets here in the on the high data here I think uh it's definitely a great conference and you should definitely things to come up with and that thing basically yeah that's it so that picture actually jumping so back this is part of talk did science and uh and the main goal of my talk is 20 . the some very interesting libraries to process data of in Python uh according to different formats they may have and small well like you on to see what should be or could be the most iconic pointed out data formats can came into play in the data processing step of course so in that case the question is what's the better way to to
process data and since we're here I this is that better question should be what's the what's the most biconic quite a bit about and we're going to see some examples of uh data formats should also be involved in data sharing uh persons what's the best way to share our data and that's basically the 2nd part of the processing to for the of presentation of data so data visualization and for
instance 1 possible way to humans without this try to share interactive chart so what they do is realization unfortunately we're not going into the spotlight historically I strongly
suggest you choose to follow the next talk about book care which is very great language and by the way the very and most common to date formats to share data and indeed data + code + documentation is Jupiter not look I'm quite sure that any of
you hear of any new uh uh what Jupiter not the key is thought that you don't please check out this variance right projects so but data processing the very 1st example of to data from what we're going use textual data format because it's the most common data from what we've got I work with in our data processing steps and let's consider
our textual file basically containing numbers so it's it's a huge the sequence of numbers and let's see what's the best way to force that type of uh formats in Python of course that the very
simple the most real solution for that is open the file and read the fine line line-by-line proved the the content in the in the list and that's it Our probably and a more profound and solution should be using context managers rather than opening and closing files that small but examples and basically we we what we need we have what we need we will we store all the information in the past but of course
this is not so efficient because we have to deal with numbers and Python lists are not very good at it so probably a better way to do that to do the artist using on people's non-treated
rescue and on the it provides a out of the box of very useful function for that so in case you have are textual file containing numbers that are basically metaphase on model not multidimensional arrays are you may leverage on the low takes the functions in basically 1 line you got what you need without arboreal concerned about the possible in formants of problem you may have you file and I as output of course not be along takes the returns and on the right rather than on
uh title which is of course more efficient in processing numbers so if we do we take a look at the little takes the
function here we see the documentation we have many many parameters here we may specify the type of numbers we want enough but in case there comments are encased and we want to convert specific problems or we want to specify how the number of dimensions for 5 uh that's very simple to use there is another function in on the package from the
library which is again from takes the and it's basically just basically the same with the very different that our if the function is able to load data from actual 5 and also in case you have missing values in it to the lord takes the x and I
expect you to have a full markets are so the number of rows and columns should match in case of going from takes the you have a wife was the strategy to deal with missing values in the in the time that are very common in textual form lets you buy it can come across is of course this year's the file this year the
violence that C is the sensible comma-separated value but in general you may have had values in these formant simply using different practices and not only comments for instance time you come characters simulations or spaces or a combination of our in this particular case we have on our is the file with the 1 of the very 1st road which is the header OK so he get the information that's why the case
when we process as the file are say we we take a look at the very simple solution in Python we have Python in the standard library we have this CSP module which is very specifically specifically
devoted to process the files and in this case we open the file we created the reader and that's it so basically we I iterate over the file line-by-line and that's up to to us to decide how to so properly the information processing in the in the form of
I think if you're in more into the scientific but sigh scientific X system of Python thing of the the very 1st solution that comes to you when you think about when the thing to see is the finest using principles of course and this is a very great that hand this deals with arms the read CSV file in a very simple again just 1 line of code you could be the path of the file and that's it so you know what you have our Canada's that frame ready packed and ready to you to use data processing if we take a look again at documentation of as we we see
that we have many many options because actually when you also see as the file can you may come across very different season performance in the handling of our non number no known the null values of unknown number of values and stuff like that so you on these in
this particular case the the basic idea is we're not actually dealing with a file
containing all the numbers but also uh data are all different types so that data frame is the best way to do that and of course as you may see in the left part of what in the left corner here of the slide in Penders you may have many many functions uh already provided to you to to process many data formats with just 1 line code in particular we see read CSB XL or HTS HTML Jason which are some of the form
which we gonna see in very few minutes let's have a little more complicated let's say the more complicated actually took lots of complicated example
C C 5 here so basically the difference from the 1st example is that here we have the 1st 10 lines uh in the file that are basically metadata and the actual data so idea is we want to speak those lines when you when we get the data into the data frame and that's very simple
and of course in the pandas over over there you may see that we just need an additional counter which is keep rose and we say 0 many roads want to skip and that's so you again from this is that the solution for
this kind of thing so you to sum up a bit on the textual data from the very 1st and simple example are to be photonic of course use context managers our number client handles all the solutions the
fury and data processing no find mostly for numerical data or data containing just numbers and penniless where C is the uh respectively look takes the and we need to use the word the functions we we saw on that actually the former does some advantages such as see it these very easy to create or recreate and share and that's very easy to process as we sort of but of course it's not so storage friendly but but it is highly compressible
and I go over another of drawback you may how inside the disadvantage of the formant uh has is that it does not support the structured information because we need to have some of you here at nowadays are potential data is not the the proper format used to we come to the 2nd example here to the binary data format and we thought my thinking that if we think of the on March of space so much like we need to represent numbers we may still see 1st integers and floats in native this example here in our native strings of
this annotation uh can see the while the storage required for numbers and strings increase according to the a number of characters we have caused the numbers of white . 4 numbers stored as numbers it's basically constant according to this this type of so areas are trying to use those representation and store the data in the original formants just up just like binary format so you are but of course this place is not the only concern of but also the matters see our world we have numbers for the spectral health basically we there time in converting those numbers in those text uh in numbers and and this is basically same because of the commission into a float is not sufficient because of the underlying the function H like what age you on the very simple way to do that in Python to store binary data sources using the people module which is important in included some library our we have an array here so basically we have on the rate of 100 thousand 10 thousand members of shape 10 uh in 10 times 1 thousand and we saw that in a binary file here with the people that function so we have here on the right and we may again from the binary file using below load that for for example you basically we don't need anything
because it's standard library but it's just Python but of course the problem in this case is that when we want to store binary data it's not
just numbers was the time we need also made tell some descriptions in binary format we wanted to to leverage to in in that particular case the the the option that is trying to things into another form at and actually the reason of the formant which is the
so-called HTS 5 former which is graphical data format it is a free and open source platform at the end it works very great with both of the or tiny time and time of its storage friendly because it allows you to have compression rate is very nice feature and it's uh also in development friendly but it has our domain-specific language to query the data in your structure basically a transport from multiple languages and that means that you may use doubtful minutes of regardless of the person you share in the but if you have a user using Python or Java our any other language so it's very very
interesting features and ask a Python we have many libraries that to the tune of famous opined tables and H 5 and I'm going to show you a couple of examples with both of these libraries just due to to see that is very different on to you want to create a new age at year 5 file which is the 2 important modulate fine fine and then we create a new file and we create a new dataset and that we specify
the numbers of of elements we want in that case it's 100 over there and the type uh say we have a new data set object enough with which these use the you may see here at the bottom but uh when you have to be aware that it's basically and on the right so it's very developed for our hands
we may also leverage of all
this slicing feature here so we on get the temperament or slicing at step of then we get basically not within the rate and the rate of the time we specify that it was integer division actually with these 5 home and the
NumPy arrays tightly integrated if we gonna use the other library in mentioned by tables actually apply cables provides you out of the box of a series of
built in these structures for your age here find files and those are the race you raise you raise of the DL array that stands for valuable and or tables of the the in texas . in that particular case here we're creating in the on the bottom of the slide we creating a new range NumPy right here and then we're creating a new table and then we're going be stable and that accessing it's uh group of documentation that's very useful and we append the nights here which is the number here weekly before and we specify these as as as the records of with those time over there and with integer 1st field and on strings with 10 captures most of 2nd field that's very old and very easy to use some of the other important features of the age find file that that we
may have guaranteeing groups so we made structure of information in our 5 so basically we might we follow
from the the root here and then we may create groups or and created datasets and depends on those datasets to the group we created so basically here we have about a specific path to follow when we want to access the data and the use structure finally created the unified um more although we may also creates our starting from the file you may also create a new dataset directly specifying the path and then we may access the use of dataset using directly the past but rather than passing by the way we created so it's very easy and
finally the last feature want to show you is that are regards to data chunking which is are useful in case you want to do in court rather than out course analytics
basically deists and when you ask continuous datasets basically the storage here is contiguous by when you have a chance to specify that to the to be file that you want to you have sparse data so you want to process by chance and the Darfur useful in case you want to use our leverage those data processing pollen that's supported actually like 85 effect if you
wanted to show an example here MPI hands with the MPI for binary east out of what integrated engaged in a firefight library here so in this particular case here in the code we're modifying the file by multiple processors and we're adding to the dataset to the ranking index which is and radio 4 times 1 thousand members of integers of these we basically and modifying the dataset with these are made and we're accessing every process success each slice it's slice it's
of specific slice of today's environment that vary by I you want to learn more about the HD if I will I
highly recommend this book and also we know that I have another talk about 85 in more into details and that's gonna be on friday yes yes uh should not be variant of the never binary format want show
you this 1 I came across very recently and it's the so
called real data format and we are many of you here already knew about route but yeah thank you very
much and actually reduce our uh a framework a tools and also data format that's 1 might included the included here and at its most environment is usually uses for data processing in general but it's mostly used in physics especially because you are in particle physics that quite the case the the use of a root for the 2 analysis and it's of great to actually it's written in C + + natively uh but terms of an extension in Python which is sometimes referred as pi root ends by the way it's 6 which is the latest version of ships with each
of the kernels to actually you might have leverage Jupiter you may leverage the root functionalities within inside the you know of our it defines a set a new binary for what which is the document and in the basic idea is it is based on a serialization of of simple spots object assume
that's on at a glance what these you have you made average of over here you may see you at the root of the ships with instructive shell just like the Python 1 so it's very useful and you
may sometimes right in a sort of C + + code in the constructed shell so you basically have a sobering truck examples but that's interesting and from some point of view and that's the browser so that's the solution here you
may see a very long list of leads in this fight and every time we open a leaf which should be a data contain container no um you see in this here because almost time when you open loop files you have is 2 grams of data just in the distribution of but
in case you want to go more into detail so you want to extract the data from different files it turns out that you have to write the so long and boring sepals plots go directly to perform very our are common operations to basically have to access a tree and leave sue the idea is that refined rather than talking about datasets and groups just like just like a fight it talks about trees the trees and leave that's the idea branches and leaves but the the the the general idea is just the same that's why I'm trying to show you uh and the
other reason is that we basically here accessing out a tree here and then we're all that somebody we have seen that some to show you actually these are to of we're we're getting the data from the tree we're getting these are of expression here that's basically these values with respect to these all the values and we're basically for holding the output of the road to these sepals + object which is age which is an anonymous obstacles losses that and we I try to over the the the entries and the beginnings of the system and to get the content of the state so we have to do it originally we share our right these spirit of the
sepals possibilities that to extract data from these formant Our fortunately we have the payment is ready mentioned and that's the
general syntax to do that in Python but but as you can see the this time will running time elapsed of any iconic feature is very simple sports-style OK so basically you have new on naming conventions just like just like the ones we already you if used in the past we just basically it seems like we're basically writing simple red constantly there are a couple of project and once we showed you
and you point you out that those named replied and
written find publisher you couple of examples they're very organized project summaries to use said getting these example here using the fine route but you may elaborates on bike and we end up writing on more widening of fossil let's see that in case of using the cat function here over the the chief to the the tree name we want we basically here that that was in the morning to call in the case of we may access the tree directly using the documentation just like a hyphen after spring and over another break with the uh root has when you want to define to instrument basically have to define the y axis with respect to the x axis which is sort of counterintuitive to they fix that in the group by projects hear you basically specify what most intuitively the expected so the
x-axis with respect to the
y axis and you basically those where syntax of our really let's I know moving the article adhere to the these weird anonymous object by just passing an after but here the set of K 1 of these approach to be stored in this space to these to the case of here which I defined here of type s which which means floating point numbers instead of th to that uh originally from the example using the written
which is very useful so you want you to get the dates and avoids to to process through those files are
being part being in each case right so you just want to I want the system only 1 these 3 and I won't be the value it all the values in it and once enough with is in that and of the right so that's the
the the the the goal of the the age of the root or a function here we passed the file the name of 3 and then the branch who want and then we get enough with and on the right and the funny the funny thing is that actually these libraries tightly integrated into pyruvate uh because system in in fact we get these and a basically we we you're here creating an histogram using the new regional hybrid library here uh and then we're feeling that these objects using the root nothing of function here and then we draw again the case when using the original that's going the so basically you're going to use the
2 libraries and the same time without worrying about the details because it's up to the libraries and finally another interesting feature about that our group required
chips with these groups to the affine my comment and utility of a lot of volatility to switch from the binary root form at the age of 5 OK that's a full binary files but we don't see I think but we don't see another and going you very quickly about this 1 because is very common and so on and I want to talk about this for much more from data processing of point of view rather than the 1st
specific reasons why for instance of so far In our where processing
Jason is the performance of choice when you have to do with the API what then XML and the reasons are many 1 of the 1 of these is that it's the lesson
of this of course and from the python point of view it's more easy to process since we basically having to do with dictionaries and item in case you were wondering in our context where Jason is using of basically Jason is the formant under the hood of
the Academy of book basically by the time of the state of the but our for be and want to
do you talk about Jason because Jason is the format of choice for document bees so this is the so called null and you have be and I
will show you a couple slides and lots of our task I made by comparing the uh the performances of HTS 5 files uh with those is the more would be the of noise of the the uh here we got we're seen that we basically had a 1 a hundred thousands of the
documents here and those documents were structured I mean there work textual documents that was to the basic idea was trying to build up the sort of information retrieval indexed so want to call
for each document all the terms and the frequencies of the terms appearing in all the documents and more specifically I wanted to to store the the particular so of the texts were all the terms were government so uh it's a sort of structured index and wants to build a so since the these ideals structures just just trying to to decide if 2 2 to test if a if I could be a possible solution when it got there was that uh from our
processing point of view the if I uh for what is not appropriate because it takes more and more time other than what would be are implemented in 2 different version actually say it was the flood storage rather than the compact storage the the different where in the how high structure in the adjacent objects going through the
queries in a moment in the uh story not uh respectively the the tone information explicitly in a nested object rather than encoded in the in terms uh say that basically the performances were
just assigned uh it was just a matter of how did the does the easiest way to deal with it programmatically in it but if we are to have the courage
performances uh basically 5 with these very simple and already provide a little box both filter which is a compression algorithm you made leverage uh it's definitely the solution to Due to studio for saving case you wants and you have storage our constraints faced defines separated of course it's not compatible in terms of efficiency because you have already be at least in this very tight placed on the and
it's just that of course there are many many things we may optimize that's not the case of these examples time for instance the the possibility to update a distribution of multiple classes and stuff like that OK and other formats of interest for me for the stock was the HDFS the HDFS is data format
for big table and I couples I'll show you a couple of slides taken from notable here uh bind the macro rockling experience of interesting to you use finally notice that there
is a library which is called a difference 3 of In the Python ecosystem HFS of course stands for different system is the distributed version of the file system on top of the of of and the data can be utilized in shorts and distributed them on several machines and basically it's the fact the fact the standard for the data IronPython we have these very great library which is HDFS we uh it works very and very good on Linux machine has some issues and to make it working on my I owe you or states of
machine that only makes it works with uh and it guys are native implementation of a presidency was lost to there is not 1 of these no just along the way so that you have a nice that 4 point answers to the example is let's try to see how we might leverage the analysis of this of CSV files distributed among the classes so here we
create a new file system here over there sorry you create a new file system and you find uh share fast sorry I we address the for these classes that we see all the the CSV file we have here we may read just 1 find here taken from the file system and using the read CSV file here and with the data in frame fact that state but more more and interestingly we may I read CSB all the CSV file here but with a wildcard here so that we opening all the here's the fast matching the query and we're accessing here that the data using these executed here which is the the server that solitude you have the distributed computation and a very funny thing is that basically if you execute these you you and the interactive you're not use our to your available so basically it's not
so when the computation and so basically you have the data you are innovative over just like it
handles the so that's very easy to use and very nice I definitely will acquire looking at when you have to do with age effects and finally yes we may also operate data frame here to filter the data we have and then we go so we get different here we also for the processing of the data and that's what I this wedding with big data here and mentioned make is like to make is that is about the alumina database of basically that's the the direction in which the Big Data would shifting died last we're moving from the so called the role-based databases where the relational databases to the column once uh so far there are 2 families to detect these 2 kinds of people in the face of the group a approach which is the table H. all our Cassandra which is sort of a data model which is based on a multidimensional mapping and rather than the group B which is the 1 chooses from these or tool here which is the sole relational data model so basically the difference is that you have data organized in columns while than new roles and that's very useful when you have to do with analytics because most of the time you end up here analyzing data of going through
columns for the then rows and that's very efficient and this the 2 would want to show you is the these warning it is called monitor the and the
reason why I'm showing you this is that it's basically ships we built-in Python support so basically you have indeed you have a Python class
all uh built-in support in it so you may right inside the database Python or R code for your analytics are in fact the monitor the type are directly mapped to number you
right so we have 2 2 promises columns in your dB they're out of books transformed in you right so you're leveraged from processing unit but very nicely instance here we're executing the period here that returns a table that's our function in directly you know included inside the DB that they're working in the DB as a process the creating a new table here that has just 1 common afloat and language of use of choices like and of course and we basically creating a random array of values and we're talking about and that's it so basically we have enough what I want to be table to
make it's working in the seats of working in a more concrete example let's say here we have 2 functions here in
more and here we basically leveraging all the functions of our society of learning here so we basically writing and thinking in but here we have a confusion matrix for some crossing and then we have more details to more statistics on the confusion matrix the we're creating so a new table with all the information want to report here we have curacy position sensitivity specificity and F 1 we're storing all these information and varies by finding quite here because it's Python you're working in the the DB and that and that's it so we return the value here and the way we use it is just included in a few weeks simple SQL queries here so we select the value from the 2 tables in a nested query and we passed the volume gathering the data from another appealing and that's very easy to use but of course it's very quick example I I highly suggest you to check out the the stock from from which of these couples find something covered uh in-database analytics with by the more than the thing um OK so that's
basically the and to there are couple things before closing I want to show is that the sum things
we make strong the way basically it's more tools rather than formant actually and I want you to 2 point you but you
coupled to a very interesting and very easy to use that on that now belongs to the fighting take a system those tools are the X rays and the planes to lasers syntactic and it's basically a sort of 1 2 1 to for all the formants basically Eritrea example rectify our and the actual rate is a sort of extension is you can think of it as a as our intermediate wife from
the number the structure and the pundits that frame because the factory rate is basically label any right so the idea is I want to have harmonic multidimensional and the a numpy arrays but I want to describe that the value of the columns and rows have to say I want to access the the the rows of the columns by name rather than just mind 2 that's the label of the rate and it's by states so it's a library based on the
uh so called the next CDF format it's on very uh and uh it's is quite popular fomenting catering physics and it
goes based on a common data model that's called a common data model that basically allows you to integrate HTS files in HDFS or other formats in 1 single data formats and and that's useful uh the guys to blaze some
guys in the because system considered sort of extension of NumPy sort off against um because it's a lot it on to tue altered call processing which is basically 1 implementation of when you have to do with numbers I mean this couple examples taken from the documentation here you may have you may create the dates object here from blaze which is which is basically talking to database here rather than others they define and that's basically what sign you can have the same thing for you when you dealing with the code and pays for the X ray here you may create a data rate gathering data from patterns that flame rather than another you're right here and basically you operate over the data just like in the just like in like so I think that's a 2
distinct inclusion I would say complicated data require complicated form complicated formats require very good tools but fortunately we have hyphen and all the time the type
system for that to type of all these people you thank you very much for your
attention me look at the thank you very
much for malaria unfortunately we don't have any time for questions Next sessions coming up well short malarial will be happy to answer outside that about thank you for the online right
Rechenschieber
Metropolitan area network
Datenverwaltung
Flächeninhalt
Dateiformat
Komplex <Algebra>
Computeranimation
Metropolitan area network
Uniforme Struktur
Punkt
Einheit <Mathematik>
Bereichsschätzung
Selbst organisierendes System
Datenverarbeitung
Programmierumgebung
Analytische Menge
Algorithmische Lerntheorie
Computeranimation
Aggregatzustand
Prozess <Physik>
Mereologie
Visualisierung
Datenverarbeitung
Programmbibliothek
Dateiformat
Kombinatorische Gruppentheorie
Dateiformat
Computeranimation
Total <Mathematik>
Notebook-Computer
Dateiformat
Code
Computeranimation
Instantiierung
Folge <Mathematik>
Zahlenbereich
Elektronische Publikation
Speicherbereichsnetzwerk
Computeranimation
Metropolitan area network
Programmfehler
Rechter Winkel
Datentyp
Datenverarbeitung
Dateiformat
Projektive Ebene
Schlüsselverwaltung
Varianz
Metropolitan area network
Datenverwaltung
Content <Internet>
Matrizenrechnung
Zahlenbereich
Mailing-Liste
Information
Kontextbezogenes System
Elektronische Publikation
Gerade
Gerade
Computeranimation
Lineares Funktional
Quader
Matrizenrechnung
Zahlenbereich
Elektronische Publikation
Systemaufruf
Gerade
Computeranimation
Metropolitan area network
Informationsmodellierung
Rechter Winkel
Gravitationsgesetz
Gerade
Funktion <Mathematik>
Array <Informatik>
Lineares Funktional
Parametersystem
Mereologie
Hausdorff-Dimension
Zahlenbereich
Vorzeichen <Mathematik>
Oval
Extrempunkt
Computeranimation
Metropolitan area network
Datentyp
Programmbibliothek
Große Vereinheitlichung
Gammafunktion
Inklusion <Mathematik>
Gruppe <Mathematik>
Schaltnetz
Zahlenbereich
Gradient
Extrempunkt
Elektronische Publikation
Raum-Zeit
Computeranimation
Metropolitan area network
Datensatz
Bildschirmmaske
Last
Diskrete Simulation
Strategisches Spiel
Information
Gravitationsgesetz
E-Mail
Logik höherer Stufe
Instantiierung
Offene Menge
Metropolitan area network
Prozess <Physik>
Standardabweichung
Programmbibliothek
Information
Gravitationsgesetz
Elektronische Publikation
Ausgleichsrechnung
Modul
Computeranimation
Modul
Subtraktion
Zahlenbereich
Physikalisches System
Elektronische Publikation
Code
Computeranimation
Konfiguration <Informatik>
Portscanner
Metropolitan area network
Zeiger <Informatik>
Gerade
Große Vereinheitlichung
Gammafunktion
Lesen <Datenverarbeitung>
Inklusion <Mathematik>
Lineares Funktional
Subtraktion
Prozess <Physik>
Rahmenproblem
Zahlenbereich
Elektronische Publikation
Computeranimation
Rechenschieber
Metropolitan area network
Bildschirmmaske
Leitungscodierung
Mereologie
Datentyp
Dateiformat
Gravitationsgesetz
Große Vereinheitlichung
Chi-Quadrat-Verteilung
Lesen <Datenverarbeitung>
Subtraktion
Elektronische Publikation
Datentyp
Rahmenproblem
Gruppe <Mathematik>
Bildauflösung
Elektronische Publikation
Computeranimation
Metropolitan area network
Wärmeausdehnung
Gerade
Chi-Quadrat-Verteilung
Haar-Integral
Informationssystem
Lineares Funktional
Client
Bit
Datenverwaltung
Datenverarbeitung
Speicher <Informatik>
Zahlenbereich
Wort <Informatik>
Kardinalzahl
Kontextbezogenes System
Dateiformat
Computeranimation
Unterring
Selbstrepräsentation
Zahlenbereich
Binärcode
Raum-Zeit
Computeranimation
Metropolitan area network
Binärdaten
Datentyp
Programmbibliothek
Speicher <Informatik>
Binärdaten
Lineares Funktional
Shape <Informatik>
Datentyp
Quellcode
Bitrate
Ausgleichsrechnung
Zeichenkette
Flächeninhalt
Ganze Zahl
Rechter Winkel
Last
Schwimmkörper
Dateiformat
Information
Datenfluss
Zeichenkette
Offene Menge
Applet
Formale Sprache
Zahlenbereich
Binärcode
Systemplattform
Computeranimation
Deskriptive Statistik
Metropolitan area network
Multiplikation
Binärdaten
Programmbibliothek
Datenstruktur
Speicher <Informatik>
Quellencodierung
Binärdaten
Datentyp
Open Source
Speicher <Informatik>
Bitrate
Domänenspezifische Programmiersprache
Dateiformat
Konfiguration <Informatik>
Uniforme Struktur
Dateiformat
Standardabweichung
Datentyp
HIP <Kommunikationsprotokoll>
Zahlenbereich
Element <Mathematik>
Elektronische Publikation
Modul
Computeranimation
Objekt <Kategorie>
Menge
Rechter Winkel
Datentyp
Minimum
Programmbibliothek
Tabelle <Informatik>
Quader
Datenerfassung
Programmbibliothek
Bitrate
Division
Computeranimation
Informationssystem
Tabelle <Informatik>
Tabelle <Informatik>
Desintegration <Mathematik>
Gruppenkeim
Zahlenbereich
Extrempunkt
Elektronische Publikation
Variable
Speicherbereichsnetzwerk
Computeranimation
Motion Capturing
Rechenschieber
Metropolitan area network
Spannweite <Stochastik>
Datenfeld
Array <Informatik>
Ganze Zahl
Bitfehlerhäufigkeit
Minimum
Information
Datenstruktur
Gammafunktion
Tabelle <Informatik>
Zeichenkette
Diskrete-Elemente-Methode
Gruppenkeim
Übergang
Analytische Menge
Wurzel <Mathematik>
Gravitationsgesetz
Datenstruktur
Elektronische Publikation
Computeranimation
Soundverarbeitung
Elektronische Publikation
Prozess <Physik>
Vererbungshierarchie
Program Slicing
Elektronische Publikation
Code
Computeranimation
Metropolitan area network
Multiplikation
Automatische Indexierung
Ganze Zahl
Klon <Mathematik>
Datenverarbeitung
Programmbibliothek
Coprozessor
Speicher <Informatik>
Parallele Schnittstelle
Chi-Quadrat-Verteilung
Umwandlungsenthalpie
Binärcode
Kernel <Informatik>
Reelle Zahl
Datenanalyse
Weg <Topologie>
Binärdaten
Program Slicing
Dateiformat
Maßerweiterung
Dateiformat
Programmierumgebung
Computeranimation
Binärcode
Kernel <Informatik>
Lineares Funktional
Datenanalyse
Physikalismus
Versionsverwaltung
Maßerweiterung
Dateiformat
Term
Binärcode
Framework <Informatik>
Computeranimation
Kernel <Informatik>
Datenverarbeitung
Dateiformat
Serielle Schnittstelle
Wurzel <Mathematik>
Maßerweiterung
Programmierumgebung
Analysis
Punkt
Sichtenkonzept
Nabel <Mathematik>
Browser
Browser
Mailing-Liste
Objektklasse
Elektronische Publikation
Code
Quick-Sort
Computeranimation
Metropolitan area network
Loop
Offene Menge
Datenerfassung
Booten
Wurzel <Mathematik>
Chatbot
Nichtlinearer Operator
Subtraktion
Einfügungsdämpfung
Content <Internet>
Verzweigendes Programm
Gruppenkeim
Plot <Graphische Darstellung>
Physikalisches System
Extrempunkt
Elektronische Publikation
Computeranimation
Netzwerktopologie
Objekt <Kategorie>
Metropolitan area network
Arithmetischer Ausdruck
Rechter Winkel
Sigma-Algebra
Gravitationsgesetz
Aggregatzustand
Funktion <Mathematik>
Metropolitan area network
Projektive Ebene
Booten
Gravitationsgesetz
Große Vereinheitlichung
Computeranimation
Eins
Quelle <Physik>
Lineares Funktional
Datentyp
Gruppenkeim
Routing
Kartesische Koordinaten
Quick-Sort
Computeranimation
Netzwerktopologie
Metropolitan area network
Total <Mathematik>
Kontrollstruktur
Projektive Ebene
Wurzel <Mathematik>
Computerunterstützte Übersetzung
Personal Area Network
Gammafunktion
Objekt <Kategorie>
Metropolitan area network
Datentyp
Punkt
Menge
Datentyp
Zoom
Zahlenbereich
Kartesische Koordinaten
Gravitationsgesetz
Personal Area Network
Große Vereinheitlichung
Raum-Zeit
Computeranimation
Lineares Funktional
Unterring
Desintegration <Mathematik>
Stichprobe
Verzweigendes Programm
Physikalisches System
Extrempunkt
Elektronische Publikation
Dialekt
Computeranimation
Objekt <Kategorie>
Histogramm
Rechter Winkel
Mereologie
Programmbibliothek
Wurzel <Mathematik>
Hybridrechner
Sichtenkonzept
Punkt
Gruppenkeim
Softwarewerkzeug
Binärcode
Computeranimation
Metropolitan area network
Bildschirmmaske
Binärdaten
Programmbibliothek
Datenverarbeitung
Wurzel <Mathematik>
Versionsverwaltung
Gammafunktion
Punkt
Sichtenkonzept
Prozess <Physik>
Kontextbezogenes System
Auswahlaxiom
Computeranimation
Data Dictionary
Instantiierung
Task
Rechenschieber
Geräusch
Dateiformat
Notebook-Computer
Programmierumgebung
Elektronische Publikation
Dateiformat
Auswahlaxiom
Computeranimation
Aggregatzustand
Information Retrieval
Metropolitan area network
Automatische Indexierung
Ideal <Mathematik>
Datenstruktur
Frequenz
Term
Quick-Sort
Computeranimation
DoS-Attacke
Objekt <Kategorie>
Punkt
Sichtenkonzept
Momentenproblem
Versionsverwaltung
Abfrage
Information
Datenstruktur
Speicher <Informatik>
Term
Computeranimation
Metropolitan area network
Nebenbedingung
Quader
Speicher <Informatik>
Quellencodierung
Term
Personal Area Network
Computeranimation
Rechenschieber
Schnelltaste
Distributionstheorie
Multiplikation
Klasse <Mathematik>
Dateiformat
Dateiformat
Makrobefehl
Computeranimation
Tabelle <Informatik>
Instantiierung
Dateiverwaltung
Subtraktion
Punkt
Klasse <Mathematik>
Versionsverwaltung
Applet
Implementierung
Physikalisches System
Dateiformat
Computeranimation
Portscanner
Virtuelle Maschine
Programmbibliothek
Dateiverwaltung
Implementierung
Standardabweichung
Analysis
Elektronische Publikation
Rahmenproblem
Gemeinsamer Speicher
Klasse <Mathematik>
Abfrage
Computerunterstütztes Verfahren
Extrempunkt
Elektronische Publikation
Menge
Computeranimation
Einfache Genauigkeit
Metropolitan area network
Server
Dateiverwaltung
Große Vereinheitlichung
Aggregatzustand
Lesen <Datenverarbeitung>
Soundverarbeitung
Managementinformationssystem
Relationale Datenbank
Nichtlinearer Operator
Subtraktion
Prozess <Physik>
Rahmenproblem
Wort <Informatik>
Datenhaltung
Relativitätstheorie
Datenmodell
Gruppenkeim
Familie <Mathematik>
Analytische Menge
Quick-Sort
Computeranimation
Gruppenoperation
Richtung
Mapping <Computergraphik>
Metropolitan area network
Datensatz
Gammafunktion
Tabelle <Informatik>
Lineares Funktional
Krümmung
Datentyp
Prozess <Physik>
Datenhaltung
Formale Sprache
Klasse <Mathematik>
Zahlenbereich
Analytische Menge
Frequenz
Code
Computeranimation
Zeichenkette
Quelle <Physik>
Metropolitan area network
Ganze Zahl
Einheit <Mathematik>
Datentyp
Randomisierung
Reelle Zahl
Sigma-Algebra
Auswahlaxiom
Instantiierung
Tabelle <Informatik>
Umwandlungsenthalpie
Lineares Funktional
Matrizenrechnung
Statistik
Sensitivitätsanalyse
Ortsoperator
Abfrage
Extrempunkt
Computeranimation
Metropolitan area network
Uniforme Struktur
Information
Spezifisches Volumen
Tabelle <Informatik>
Gammafunktion
Ebene
Punkt
Gewichtete Summe
Physikalisches System
Maßerweiterung
Bitrate
Dateiformat
Quick-Sort
Strömungsgleichrichter
Computeranimation
Datensatz
Rechter Winkel
Physikalismus
Programmbibliothek
Dateiformat
Zahlenbereich
Faktor <Algebra>
Datenstruktur
Bitrate
Computeranimation
Aggregatzustand
Array <Informatik>
Prozess <Physik>
Datenhaltung
Datenmodell
Einfache Genauigkeit
Systemaufruf
Implementierung
Zahlenbereich
Physikalisches System
Bitrate
Elektronische Publikation
Code
Quick-Sort
Computeranimation
Integral
Metropolitan area network
Rechter Winkel
Vorzeichen <Mathematik>
Mustersprache
Speicherabzug
Dateiformat
Maßerweiterung
Metropolitan area network
Bildschirmmaske
Rechter Winkel
Datentyp
Dateiformat
Physikalisches System
Inklusion <Mathematik>
Dateiformat
Computeranimation

Metadaten

Formale Metadaten

Titel Data Formats for Data Science
Serientitel EuroPython 2016
Teil 84
Anzahl der Teile 169
Autor Maggio, Valerio
Lizenz CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben
DOI 10.5446/21234
Herausgeber EuroPython
Erscheinungsjahr 2016
Sprache Englisch

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract Valerio Maggio - Data Formats for Data Science The CSV is the most widely adopted data format. It used to store and share *not-so-big* scientific data. However, this format is not particularly suited in case data require any sort of internal hierarchical structure, or if data are too big. To this end, other data formats must be considered. In this talk, the different data formats will be presented and compared w.r.t. their usage for scientific computations along with corresponding Python libraries. ----- The *plain text* is one of the simplest yet most intuitive format in which data could be stored. It is easy to create, human and machine readable, *storage-friendly* (i.e. highly compressible), and quite fast to process. Textual data can also be easily *structured*; in fact to date the CSV (*Comma Separated Values*) is the most common data format among data scientists. However, this format is not properly suited in case data require any sort of internal hierarchical structure, or if data are too big to fit in a single disk. In these cases other formats must be considered, according to the shape of data, and the specific constraints imposed by the context. These formats may leverage *general purpose* solutions, e.g. [No]SQL databases, HDFS (Hadoop File System); or may be specifically designed for scientific data, e.g. hdf5, ROOT, NetCDF. In this talk, the strength and flaws of each solution will be discussed, focusing on their usage for scientific computations. The goal is to provide some practical guidelines for data scientists, derived from the the comparison of the different Pythonic solutions presented for the case study analysed. These will include `xarray`, `pyROOT` *vs* `rootpy`, `h5py` *vs* `PyTables`, `bcolz`, and `blaze`. Finally, few notes about the new trends for **columnar databases** (e.g. *MonetDB*) will be also presented, for very fast in-memory analytics.

Ähnliche Filme

Loading...