Data Formats for Data Science - TIB AV-Portal

Data Formats for Data Science

00:00

328

Maggio, Valerio

Formal Metadata

Title

Data Formats for Data Science

Title of Series

EuroPython 2016

Part Number

84

Number of Parts

169

Author

Maggio, Valerio

License

CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this

Identifiers

10.5446/21234 (DOI)

Publisher

Release Date

Language

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

Valerio Maggio - Data Formats for Data Science The CSV is the most widely adopted data format. It used to store and share *not-so-big* scientific data. However, this format is not particularly suited in case data require any sort of internal hierarchical structure, or if data are too big. To this end, other data formats must be considered. In this talk, the different data formats will be presented and compared w.r.t. their usage for scientific computations along with corresponding Python libraries. ----- The *plain text* is one of the simplest yet most intuitive format in which data could be stored. It is easy to create, human and machine readable, *storage-friendly* (i.e. highly compressible), and quite fast to process. Textual data can also be easily *structured*; in fact to date the CSV (*Comma Separated Values*) is the most common data format among data scientists. However, this format is not properly suited in case data require any sort of internal hierarchical structure, or if data are too big to fit in a single disk. In these cases other formats must be considered, according to the shape of data, and the specific constraints imposed by the context. These formats may leverage *general purpose* solutions, e.g. [No]SQL databases, HDFS (Hadoop File System); or may be specifically designed for scientific data, e.g. hdf5, ROOT, NetCDF. In this talk, the strength and flaws of each solution will be discussed, focusing on their usage for scientific computations. The goal is to provide some practical guidelines for data scientists, derived from the the comparison of the different Pythonic solutions presented for the case study analysed. These will include `xarray`, `pyROOT` *vs* `rootpy`, `h5py` *vs* `PyTables`, `bcolz`, and `blaze`. Finally, few notes about the new trends for **columnar databases** (e.g. *MonetDB*) will be also presented, for very fast in-memory analytics.

Speech

Text

Image

00:00

Data analysisComplex (psychology)Goodness of fitMultiplication signAreaData managementSlide ruleLecture/Conference

00:41

Uniformer RaumMetropolitan area networkHand fanElectronic data processingAnalytic setMachine learningPoint (geometry)Self-organizationConfidence intervalExecution unitState of matterTwitterVirtual machineComputer animation

01:38

MereologyProcess (computing)Library (computing)Electronic data processingCASE <Informatik>Series (mathematics)Lecture/Conference

02:40

Shared memoryMereologyVisualization (computer graphics)Process (computing)Presentation of a groupInteractive televisionInstance (computer science)Computer animationLecture/Conference

03:21

LaptopCodeLaptopRight angleProjective planeKey (cryptography)VarianceElectronic data processingCASE <Informatik>Lecture/Conference

04:09

Matrix (mathematics)Line (geometry)Computer fileNumberSequenceContext awarenessInformationLine (geometry)Electronic mailing listData managementContent (media)Source codeXMLComputer animation

05:01

Matrix (mathematics)Line (geometry)Electronic mailing listNumberFunctional (mathematics)Computer fileCASE <Informatik>CuboidEndliche ModelltheorieArray data structureLine (geometry)Function (mathematics)Right angleLecture/ConferenceComputer animation

05:54

Sign (mathematics)Uniformer RaumMathematical singularitySimulationMach's principleMaxima and minimaFunctional (mathematics)NumberDimensional analysisLibrary (computing)CASE <Informatik>Parameter (computer programming)Computer fileDifferent (Kate Ryan album)Function (mathematics)Lecture/ConferenceXMLComputer animation

06:50

PiComputer fileCASE <Informatik>NumberForm (programming)Multiplication signRow (database)Strategy gameInformationSpacetimeCombinational logicComputer simulationInstance (computer science)EmailLecture/ConferenceSource codeComputer animation

07:46

Proper mapComputer fileOpen setStandard deviationModule (mathematics)Maxima and minimaInsertion lossMetropolitan area networkNewton's law of universal gravitationData acquisitionMach's principleLibrary (computing)Module (mathematics)Computer fileProcess (computing)CASE <Informatik>InformationLine (geometry)Lecture/ConferenceXML

08:22

CodeLine (geometry)ArmComputer fileReading (process)Physical systemElectronic data processingFrame problemLecture/Conference

09:03

Pulse (signal processing)Metropolitan area networkMaxima and minimaChi-squared distributionSign (mathematics)World Wide Web ConsortiumHost Identity ProtocolMathematical singularityDifferent (Kate Ryan album)NumberComputer configurationPointer (computer programming)Computer fileCASE <Informatik>Frame problemForm (programming)Process (computing)Reading (process)Line codeSlide ruleFunctional (mathematics)MereologyLine (geometry)CodeSource codeXMLLecture/Conference

10:06

Metropolitan area networkSystem callConditional-access moduleInformation systemsLine (geometry)Computer fileDifferent (Kate Ryan album)Row (database)Frame problemLecture/ConferenceSource codeXMLComputer animation

10:53

Data storage deviceBitContext awarenessClient (computing)Data managementNumberFunctional (mathematics)Numeral (linguistics)Reading (process)WordElectronic data processingLecture/ConferenceComputer animation

11:44

IntegerBinary codeSpacetimeNumberInformation2 (number)CASE <Informatik>HierarchyString (computer science)Lecture/Conference

12:28

Electronic data interchangeString (computer science)IntegerElectronic program guideBinary fileDataflowFunctional (mathematics)Source codeDampingBinary fileBinary codeStructural loadShape (magazine)Multiplication signData storage deviceRight angleLibrary (computing)Bit rateNumberString (computer science)Representation (politics)AreaSpacetimeCore dumpCondition numberComputer animation

14:12

Uniformer RaumFreewareOpen setData storage deviceCASE <Informatik>Standard deviationLibrary (computing)Binary codeData storage deviceData structureMultiplicationDescriptive statisticsComputer configurationJava appletMultiplication signBinary fileComputer fileOpen sourceProblemorientierte ProgrammierspracheFormal languageComputing platformNumberData compressionBit rateLecture/ConferenceComputer animation

15:29

Element (mathematics)Module (mathematics)NumberComputer fileTable (information)Library (computing)Lecture/Conference

16:06

NumberGreatest elementRight angleElement (mathematics)CASE <Informatik>Object (grammar)Software developerBit rateDivision (mathematics)Multiplication signBitXMLComputer animationLecture/Conference

16:52

DisintegrationArray data structureTable (information)Tabu searchCuboidLibrary (computing)Table (information)Series (mathematics)Data structureMotion captureField (computer science)String (computer science)NumberMultiplication signGroup actionCASE <Informatik>Computer fileIntegerSlide ruleRange (statistics)Greatest elementLengthVariable (mathematics)Row (database)Lecture/ConferenceXML

18:06

Computer fileRootInformationData structureGroup actionLecture/ConferenceComputer animation

18:56

Electronic program guideParallel portDevice driverIntegerDisintegrationHost Identity ProtocolAnalytic setCASE <Informatik>Computer fileMultiplicationRankingElectronic data processingData storage deviceLibrary (computing)Process (computing)Program slicingMultiplication signIntegerCodeSound effectSubject indexingCoprocessorLecture/ConferenceComputer animation

20:19

TrailData analysisBinary fileExtension (kinesiology)Kernel (computing)Latent heatProgram slicingIntegrated development environmentBinary fileRootReal numberLecture/ConferenceComputer animation

20:56

Extension (kinesiology)Revision controlMathematical analysisCASE <Informatik>Term (mathematics)PhysicalismRootIntegrated development environmentElectronic data processingSoftware frameworkKernel (computing)Particle systemData analysisLecture/Conference

21:45

Data analysisBinary fileExtension (kinesiology)Kernel (computing)RootFunctional (mathematics)Binary codeKernel (computing)Gastropod shellSerial portObject (grammar)LaptopComputer animation

22:20

Computer fileQuicksortView (database)Interactive televisionPoint (geometry)CodeWeb browserGastropod shellCASE <Informatik>RootMultiplication signElectronic mailing listDistribution (mathematics)Open setLoop (music)Lecture/Conference

22:58

Different (Kate Ryan album)Computer fileCASE <Informatik>Branch (computer science)Network topologyOperator (mathematics)Group actionPlotterRootLecture/Conference

23:38

Metropolitan area networkCountingRow (database)CodeExpressionFunction (mathematics)Network topologyState of matterRight anglePhysical systemContent (media)Object (grammar)Insertion lossXML

24:26

Maxima and minimaCountingComputer fileMetropolitan area network1 (number)Projective planeMultiplication signCodeComputer programmingLecture/ConferenceComputer animation

25:09

Software testingCASE <Informatik>Projective planeControl flowRootNetwork topologySpring (hydrology)Functional (mathematics)Cartesian coordinate systemQuicksortRoutingComputer-assisted translationGroup actionCodeLecture/ConferenceXML

26:13

Metropolitan area networkMathematical singularityMaxima and minimaSample (statistics)Cartesian coordinate systemPoint (geometry)Computer fileNumberObject (grammar)CASE <Informatik>SpacetimeBinary fileRootOvalHistogramRow (database)Function (mathematics)Lecture/ConferenceXML

26:57

Maxima and minimaSample (statistics)HistogramDisintegrationCASE <Informatik>Right angleMereologyPhysical systemFunction (mathematics)HistogramBranch (computer science)Computer fileFunctional (mathematics)RootObject (grammar)Hybrid computerLibrary (computing)DialectLecture/ConferenceXML

27:57

Metropolitan area networkTorusSimulationMaxima and minimaLibrary (computing)Multiplication signRootUtility softwareGroup actionBinary fileBinary codeElectronic data processingPoint (geometry)View (database)Form (programming)Lecture/ConferenceSource codeXML

28:38

Web 2.0Instance (computer science)Process (computing)Axiom of choiceData dictionaryContext awarenessPoint (geometry)CASE <Informatik>View (database)Lecture/Conference

29:18

LaptopData Encryption StandardComputer fileLaptopMultiplication signState of matterAxiom of choiceSlide ruleSoftware testingNoise (electronics)Task (computing)Panel paintingComputer animationLecture/Conference

29:59

QuicksortInformation retrievalSubject indexingData structureIdeal (ethics)Term (mathematics)FrequencySoftware testingPoint (geometry)Process (computing)Time zoneDiagramLecture/Conference

30:47

Data storage deviceData structureObject (grammar)Denial-of-service attackView (database)Revision controlPoint (geometry)Query languageMultiplication signDifferent (Kate Ryan album)InformationMoment (mathematics)Term (mathematics)Time zoneLecture/Conference

31:26

Insertion lossConstraint (mathematics)Data storage deviceCASE <Informatik>Data compressionTerm (mathematics)CuboidObservational studyProgram flowchart

32:15

File systemPort scannerImplementationJava appletCASE <Informatik>Gene clusterMultiplicationSocial classDistribution (mathematics)Multiplication signInstance (computer science)Keyboard shortcutMacro (computer science)Different (Kate Ryan album)Slide ruleTable (information)Virtual machineLibrary (computing)Physical systemRevision controlFile systemStandard deviationLaptopDivisorLecture/ConferenceComputer animation

33:30

Computer fileSingle-precision floating-point formatEmailDiscrete element methodUniformer RaumSocial classPoint (geometry)Mathematical analysisImplementationVirtual machineGene clusterJava appletFile systemComputer fileShared memoryNeuroinformatikQuery languageServer (computing)Frame problemState of matterReading (process)Interactive televisionLaptopLecture/ConferenceComputer animation

34:56

Independence (probability theory)WordGoogolData modelRelational databaseMultiplicationMetropolitan area networkNeuroinformatikDifferent (Kate Ryan album)Row (database)Group actionQuicksort1 (number)DatabaseDirection (geometry)Data modelFrame problemMultiplication signMappingFamilyRelational databaseProcess (computing)Sound effectOperator (mathematics)Theory of relativityAnalytic setTable (information)Lecture/ConferenceComputer animation

36:34

Metropolitan area networkIntegerReal numberString (computer science)Beer steinCurvatureRow (database)Social classAnalytic setArray data structureCodeDatabaseNumberLecture/ConferenceComputer animation

37:11

Metropolitan area networkString (computer science)Beer steinIntegerReal numberCurvatureoutputFormal languageAxiom of choiceTable (information)Process (computing)FrequencyFunctional (mathematics)Execution unitInstance (computer science)RandomizationDampingQuery languageArray data structureFunction (mathematics)XMLComputer animation

37:53

Maxima and minimaMetropolitan area networkSystem callUniformer RaumFunctional (mathematics)InformationQuery languageStatisticsCodeProcess (computing)Slide ruleDatabaseTable (information)Analytic setMatrix (mathematics)Sensitivity analysisLatent heatPosition operatorVolume (thermodynamics)Lecture/ConferenceComputer animation

39:20

Price indexSummierbarkeitPoint (geometry)Bit rateRectifierPhysical systemExtension (kinesiology)PlanningQuicksortSlide ruleSource codeLecture/ConferenceComputer animation

40:13

Row (database)Library (computing)Frame problemData structureFactory (trading post)Array data structureNumberBit rateState of matterRight angleLecture/Conference

40:45

PhysicalismCASE <Informatik>Data modelSingle-precision floating-point formatComputer fileINTEGRALComputer animationLecture/Conference

41:19

Port scannerNewton's law of universal gravitationPhysical systemQuicksortProcess (computing)Right angleSign (mathematics)ImplementationPattern languageSystem callCodeBit rateExtension (kinesiology)NumberDatabaseFrame problemLimit (category theory)Core dumpCASE <Informatik>Computer animation

42:18

Chi-squared distributionForm (programming)Multiplication signInclusion mapPhysical systemRight angleLecture/ConferenceComputer animationMeeting/Interview

Transcript: English(auto-generated)

00:00

So, I'm happy to introduce the next speaker to you, Valerio Maggio, he's with FBK. And, yep, please give a big welcome to Valerio.

00:24

Good morning, everyone, and thank you very much for coming. For today's data formats for data science, very quick slide about me. I'm a postdoc researcher in FBK, currently in the complex data analytics unit.

00:42

I'm interested in machine learning, text data processing, and recently with deep divergences, with deep learning and stuff like that. I'm a fellow patent lister since 2006, and I'm one of the main organizers of the Pi Data Italy that I ask everyone interested here to check out.

01:07

We have a Twitter account, and we had a couple of conferences in the last two years. One this year, in Florence, and together with, yeah, together with Python Italia, it was fun.

01:20

We had a lot of fun, so please check out if you're interested. Another thing that's worthwhile mentioning, in London, there will be Euro SciPy this year. It will be at the end of August, and actually the early tickets is gonna hand

01:41

today actually, but it's definitely worthwhile. So since you're in the Pi Data here, I think, it's definitely a great conference, and you should definitely think to come. And that thing basically, yeah, that's it. So thank you.

02:01

Actually joking, yeah. So back to the series part of the talk. Data formats for data science. The main goal of my talk is try to point you some very interesting libraries to process data in Python according to different formats they may have.

02:22

And moreover, let's try to see what should be or could be the most Pythonic way to do that. Data formats came into play in the data processing step, of course. So in that case, the question is, what's the better way to process data?

02:42

And since we're here Pythonists, the better question should be, what's the most Pythonic way to do that? And we're gonna see some examples of that. Data formats should also be involved in data sharing. For instance, what's the best way to share our data?

03:02

And that's basically the second part of the processing. So it's for the presentation of data, so data visualization. And for instance, one possible way to answer that is try to share interactive charts for data visualization. Unfortunately, we're not going into this, but I strongly suggest you to follow the next talk

03:25

about Bokeh, which is a very great library for that. And by the way, the very most common to date format to share data, and indeed, data plus code plus documentation,

03:40

is the Jupyter Notebook. I'm quite sure that any of you here already know what Jupyter Notebook is, but in case you don't, please check out this very great project. So back to the data processing. The very first example of data format we're gonna see is the textual data format,

04:02

because it's the most common data format we're gonna work with in our data processing step. And let's consider a textual file basically containing numbers. So it's a huge sequence of numbers.

04:20

And let's see what's the best way to process that type of format in Python. Of course, the most trivial solution for that is open the file and read the file line by line, put the content in a list, and that's it.

04:42

Probably a more Pythonic solution should be using context managers rather than opening and closing files. That's more Pythonic, of course. And basically, we have what we need. We store all the information in the files.

05:02

Of course, this is not so efficient because we have to deal with numbers, and Python lists are not very good at it. So probably a better way to do that is using NumPy, of course. NumPy to rescue. And NumPy provides, out of the box, a very useful function for that.

05:22

So in case you have a textual file containing numbers that are basically matrices or multidimensional arrays, you may leverage on the low txt function. In basically one line, you got what you need without being worried or concerned

05:43

about the possible format problem you may have in your file. And as output, of course, NumPy low txt returns a NumPy array rather than a Python list, which is, of course, more efficient in processing numbers.

06:03

If we take a look at the low txt function here, we see in the documentation we have many, many parameters here. We may specify the type of numbers we want in output in case there are comments, in case we want to convert specific columns,

06:22

or we want to specify our number of dimensions for the file. That's very simple to use. There is another function in the NumPy package, NumPy library, which is gen from txt. And that's basically the same with the very difference that that function

06:43

is able to load data from a textual file, also in case you have missing values in it. So the low txt expects you to have a full matrix, so the number of rows and columns should match. In case of gen from txt, you have a way to specify a strategy

07:02

to deal with missing values in the file. Another very common textual format you may come across is, of course, the CSV file. The CSV file, CSV stands for comma separated value, but in general you may have values in this format

07:21

simply using different characters, not only commas, for instance, tabu, characters, tabulations, or spaces, or a combination of that. In this particular case, we have a CSV file with the very first row

07:43

which is the header, so it keeps the information. That's quite the case when we process CSV file. So if we take a look at a very simple solution in Python, we have in Python, in the standard library, we have the CSV module,

08:01

which is very specifically devoted to process CSV files. And in this case, we open the file, we create the reader, and that's it. So basically we iterate over the file line by line, and that's up to us to decide how to store properly the information we process in the file.

08:25

If you're more into the scientific, let's say scientific system of Python, I think that the very first solution that comes to you when you think to CSV file is using pandas, of course,

08:41

because pandas is very great at that. Pandas builds with the read CSV file, and that's very simple. Again, just one line of code. You put the path of the file, and that's it. So in output, you have pandas data frame ready, packed, and ready to use for data processing.

09:00

If we take a look again at the documentation of read CSV, we see that we have many, many options because actually when you process CSV file, you may come across very differences in the formats in the handling of non-number, no null values,

09:20

no number values, and stuff like that. So in this particular case, the basic idea is we're not actually dealing with a file containing all the numbers, but also data of different type. So the data frame is the best way to do that. And of course, as you may see in the left corner here

09:46

of the slide, in pandas, you may have many, many functions already provided to process many data formats with just one line of code. In particular, we see read CSV, Excel, or HDF, HTML, JSON,

10:04

which are some other formats we're gonna see in a very few minutes. Let's have a more complicated, actually not so complicated example of a CSV file here. So basically, the difference from the first example is

10:21

that here we have the first 10 lines in the file that are basically metadata, not actual data. So the idea is we want to skip those lines when we get the data into the data frame. And that's very simple. And of course, in pandas, over there, you may see

10:42

that we just need an additional parameter, which is skip rows, and we say how many rows we want to skip, and that's it. So again, pandas is the solution for this kind of thing. So to sum up a bit on the textual data from the very first and simple example we saw.

11:01

To be pectonic, of course, use context managers. NumPy and pandas are the solutions if you're in data processing. NumPy mostly for numerical data, or data containing just numbers, and pandas for CSV. Respectively, low-txt and read CSV

11:21

were the functions we saw. The textual data format has some advantages, such as it is very easy to create or recreate and share, and that's very easy to process, as we saw. But of course, it's not so storage-friendly,

11:42

but it is highly compressible. And moreover, another drawback you may have, let's say, another disadvantage the format has is that it does not support the structured information. In case we need to have some hierarchy in our data,

12:04

the textual data is not the proper format to use. So we come to the second example here, to the binary data format, and we start by thinking that if we think of how much space, so how much bytes we need to represent numbers,

12:21

we may see, for instance, integers and floats in native, in this example here, in native strings representation. As you can see, while the storage required for numbers in strings increase according to the number of characters we have, of course, the numbers of bytes required

12:42

for numbers stored as numbers is basically constant according to the type, of course. So the idea is, try to use those representation and store the data in the original format, just like a binary format.

13:01

But of course, the space is not the only concern for text but also speed matters. So when we have numbers stored as textual files, basically we lose time in converting those numbers, those text in numbers, and basically that's it

13:23

because the condition to int or float is not sufficient because of the underlying C function, A to I, or A to F. The very simple way to do that in Python to store binary data is, for instance, using the PQL module, which is included in standard

13:41

library, we have an array here. So basically we have an array of 10,000 numbers reshaped by 10 in 10 times 1,000. And we store that in a binary file here with a PQL dump function.

14:02

So we have here an array and we may load again from the binary file using PQL load. That's very simple to use. Basically we don't need anything because it's standard library, so it's just Python. But of course, the problem in this case is that when we want to store binary data,

14:22

it's not just numbers. Most of the time we need also metadata or some descriptions in the binary format we want to leverage. So in that particular case, the option is trying to think to another format, and actually there is another format,

14:41

which is this so-called HDF5 format, which is hierarchical data format. It is a free and open source file format, and it works very great with both big or tiny data. It's storage friendly because it allows you to have compression, that's a very nice feature.

15:04

And it's also development friendly. It has a domain specific language to query the data in your structure basically. It has support for multiple language, and that means that you may use that format,

15:21

regardless the person you're sharing the data you have is using Python or Java or any other language. So it's a very interesting feature. And as for Python, we have many libraries. The two most famous are PyTables and H5Py,

15:41

and I'm gonna show you a couple of examples with both of these libraries, just to see the very difference. So if you want to create a new HDF5 file, we just need to import the module H5Py,

16:00

and then we create a new file, and we create a new data set in it. We specify the numbers of elements we want, in that case it's 100 over there, and the type. So we have a new data set object in output, which is, you may see here at the bottom,

16:22

but when you have to deal with it, it's basically an MP array, so it's very development friendly. And we may also leverage on the slicing feature here, so we may get the 10th element, or slicing at step of 10.

16:40

So we get basically an output, an array, an MP array of the type we specified there. It was integer 32 bits. Actually with these file format, the MP array is tightly integrated. If we're gonna use the other library I mentioned, PyTables, actually PyTables provides you,

17:01

out of the box, a series of built in data structures for your HDF5 files, and those are array, C array, E array, VL array, that stands for variable length array, or table. The syntax is quite the same.

17:21

In that particular case here, we're creating, at the bottom of the slide, we're creating a new array, numpy array here, and then we're creating a new table, and then we're filling this table and accessing it through documentation, so it's very useful.

17:41

And we append the nights here, which is the numpy array we created before. And we specify these as an array of records with those types over there. So it's integer as the first field, and strings with 10 characters at most for the second field.

18:01

That's very useful, and very easy to use. The other important feature of the HDF5 file is that we may have Yerky and groups, so we may structure the information in our file. So basically we start from the root here, and then we may create groups, and create data sets, and append those data sets

18:23

to the group we created. So basically here we have a specific path to follow when we want to access the data in the structure file we created in the HDF5. Moreover, we may also create, starting from the file,

18:41

we may also create a new data set directly, specifying the path, and then we may access those data sets using directly the path, rather than passing by the group we created. So it's very easy to use. And finally, the last feature I want to show you is that regards data chunking,

19:02

which is pretty useful in case you want to do in-core, rather than out-of-core analytics. The basic idea is, when you have contiguous data sets, basically the storage here is contiguous, but when you have chunks, you specify that to the HDF5 file that you want to have sparse data,

19:25

so you want to process by chunks, and that's very useful in case you want to leverage those data processing in parallel. That's a feature supported, actually, by HDF5. In fact, if you wanted to show an example here,

19:41

MPI is, with the MPI4Py library, is out-of-box integrated in the H5Py library here, so in this particular case here in the code, we are modifying the file by multiple processes, and we are adding to the data set, to the rank index, which is an array of

20:02

four times 1,000 numbers of integers. We're basically modifying the data set with this array, and we're accessing, every process access, each slice, its specific slice of the data set in parallel.

20:23

That's very nice. If you want to learn more about HDF5, I highly recommend this book, and also we're gonna have another talk about HDF5 more into details, and that's gonna be on Friday, I guess, yes. It's gonna be very interesting.

20:42

Another binary format I want to show you is one I came across very recently, and it's the so-called root data format. I don't know how many of you here already know about root, but yeah, thank you very much. Actually, root is a framework,

21:02

a tool, and also a data format. That's why I decided to include it here. And it's, most of the time, it uses for data processing in general, but it's mostly used in physics,

21:21

and especially in case you are in particle physics, that's quite the case. You use root for data analysis. It's a great tool, actually. It's written in C++ natively, but it has an extension in Python, which is sometimes referred as Pyroot.

21:40

And by the way, root six, which is the latest version of root, ships with a Jupyter kernel. So actually, you may leverage Jupyter. You may leverage the root functionalities inside the Jupyter notebook. It defines a set, a new binary format, which is the dot root, and the basic idea is it is based on

22:03

serialization of C++ objects. So that's, at a glance, what root is. You may leverage over here, you may see, root ships with an interactive shell, just like the Python one, so it's very useful.

22:20

And you may sometimes write in a sort of C++ code in the interactive shell, so you basically have a sort of interactive C++, that's interesting, from some point of view. And that's the browser, so that's the file. So here you may see a very long list of leaves in this particular file,

22:41

and every time you open a leaf, which should be a data container, you see an Instagram here, because most of the time when you open root files, you have Instagrams on your data, just to know the distribution. But in case you want to go more into details,

23:02

and you want to extract the data from the root files, it turns out that you have to write this long and boring C++ code, actually, to perform very common operations, so basically you have to access a tree and leaf.

23:21

So the idea is that a root file, rather than talking about data sets and groups, just like HDF5, it talks about trees and leaves, that's the idea, so branches and leaves. But the general idea is just the same, that's why I decided to show you. And the other reason is that we basically here

23:40

are accessing a tree here, and then we are, that's a very weird syntax I want to show you, actually this is a 2D Instagram, we are getting the data from the tree, we are getting these expression here, that's basically these values

24:02

with respect to these other values, and we're basically forwarding the output of these row to these C++ object, which is H, which is an anonymous C++ Instagram, and we iterate over the entries and the beans of these Instagram to get the content, and that's it.

24:21

So we have to, originally we should write these very awkward C++ code to do that, to extract data from these format. Fortunately, we have the Pyroot, as already mentioned, and that's the general syntax to do that in Python,

24:42

but as you can see, the style, the programming style, lacks of any Pythonic feature, it's very C++ style. So basically you have no naming conventions, just like the ones we already get used in the PEP8,

25:02

we're just basically, it seems like we're basically writing C++ code. But, functionally, there are a couple of projects I want to show to you and to point you out, that are those named rootPy and rootNumpy, I'm gonna show you a couple of examples. They're very nice projects and very easy to use.

25:22

So, getting these example here using the Pyroot, you may leverage on rootPy, and we end up writing a more Pythonic code. First of all, let's say that in case of using the get function here over the t file, to get the tree name we want, we basically here, that was in the Monte Carlo

25:42

in that case, we may access the tree directly using the dot notation, just like a Python object, it's very nice. And moreover, another very weird thing, root has, when you're going to define a 2D histogram, basically you have to define the y-axis

26:01

with respect to the x-axis, which is sort of counterintuitive. So they fixed that in the rootPy project. So here, you basically specify what's most intuitively expected. So, x-axis with respect to the y-axis. And you basically avoid those weird syntax of,

26:23

let's say, moving the output here to these weird anonymous object by just passing an attribute here. So you said, okay, I want these row to be stored in this 2D histogram here, which I define here of type f,

26:42

which means floating point numbers, instead of th2f originally defined in root. Another example using the root numpy, which is very useful, so you want to get the data and the void to process those files bin per bin

27:00

in each histogram. So you just want, I want these histogram, I want these three, and I want the value in it, all the values in it, I want an output as an numpy array. So that's the goal of the aim of the root array function here. So we pass the file, the name of the tree,

27:21

and then the branch we want, and then we get an output with an numpy array. And the funny thing is that, actually, this library is tightly integrated in the Pyroot ecosystem. In fact, we get these numpy array. Basically, we are here creating an histogram

27:40

using the original Pyroot library here. And then we're filling these objects using the root numpy function here, and then we draw, again, the histogram using the original object. That's very nice to use. So basically, you're gonna use the two libraries at the same time without worrying about the details,

28:01

because it's up to the libraries. And finally, another interesting feature about that, rootpy ships with these root to hdf5 comment and utility that allows you to switch from the binary root format to the hdf5 format.

28:22

Okay, that's it for the binary files. We're gonna see, yeah, thank you. We're gonna see another, I'm gonna go very quickly about this format because it's very common, and I want to talk about this format more from a data processing point of view rather than the very specific reasons why,

28:43

for instance, so far in web processing, JSON is the format of choice when you have to deal with API rather than XML. And the reasons are many-fold. One of these is that it's less verbose, of course,

29:02

and from the Python point of view, it's more easy to process since we're basically having to deal with dictionaries and Python lists. In case you were wondering in our context where JSON is using, basically, JSON is the format under the hood of the ipython notebook.

29:21

So basically, a Python notebook is a JSON file. But for this talk, I want to talk about JSON because JSON is the format of choice for document-oriented DBs, so the so-called NoSQL DBs. And I want to show you a couple of slides of a test I made comparing the performances

29:45

of hdf5 files versus the MongoDB. No SQL DB. So here, we are seeing that we basically had 100,000 of documents here,

30:02

and those documents were structured, I mean, they were textual documents. The basic idea was trying to build the sort of information retrieval index. So I want to store for each document all the terms and the frequencies of the terms appearing in all the documents.

30:21

More specifically, I wanted to store the particular zone of the text where all the terms were gathered. So it's a sort of structured index I wanted to build. So since these idea of structures, I just tried to decide if,

30:42

to test if hdf5 could be a possible solution, and what I got was that from a processing point of view, the hdf5 format is not so appropriate because it takes more and more time rather than MongoDB implemented in two different version, actually.

31:02

So it was the flat storage rather than the compact storage. The differences were in how I structured the JSON objects going through the queries in the MongoDB, storing or not, respectively, the zone information explicitly in a nested object

31:21

rather than encoded in the terms. So basically, the performances were just the same. It was just a matter of the easiest way to deal with it, programmatically, I mean. But if we look at the storage performances, these hdf5 with these very simple

31:46

and already provided out-of-the-box BLOSC filter, which is a compression algorithm you may leverage, it's definitely the solution to go for. So in case you want, you have storage,

32:03

constraints, hdf5's a great tool. Of course, it's not comparable in terms of efficiency in case you have MongoDB, at least in this very tight case study. And it's just, of course, there are many, many things we may optimize, that's not the case

32:20

of these example, of course. For instance, the possibility to have data distributed on multiple clusters and stuff like that. Okay, another format of interest for me for this talk was the HDFS. The HDFS is data format for Bigtable. I'm gonna show you a couple of slides taken from Dean's notebook here by Mark Rockling.

32:45

It's very interesting to finally notice that there is a library which is called HDFS3 in the Python ecosystem. HDFS, of course, stands for ad hoc file system. It's the distributed version of the file system

33:02

built on top of a loop. And the data can be organized in charts and distributed among several machines. And basically, it's the factor standards for big data. In Python, we have this very great library which is HDFS3. It works very good on Linux machines.

33:23

I had some issues to make it working on my OSX machine, but on Linux it works very good. And it has a native implementation of HDFS in C++ so there is no Java along the way.

33:42

Yeah, very nice. That's the point. So the example is, let's try to see how we may leverage the analysis of CSV files distributed among the clusters. So here we create a new file system here. Over there, sorry. We create a new file system, HDF file.

34:01

HDFS, sorry. And we allow this file system. We see all the CSV file we have here. We may read just one file here taken from the file system and using the read CSV file here and put the data in the data frame. In fact, that's it. But more interestingly, we may read the CSV,

34:23

all the CSV file here with a wildcard here. So basically we're opening all the CSV files matching this query. And we're accessing here the data using this executor here, which is the server that allows you

34:41

to have the distributed computation. And the very funny thing is that basically if you execute these in the notebook, the interactivity of the notebook is still available. So basically it's not blocking. That's very nice. When the computation ends, basically you have the data you have

35:02

in the data format, just like a panda's data frame. So that's very easy to use and very nice. Definitely worthwhile looking at when you have to deal with HDFS. And finally, yes, we may also operate on data frame here to filter the data we have

35:21

and then we go, so we get another data frame here and we also further processing our data. That's very nice. Since we're dealing with big data here, another mention I would like to make is that it's about a lunar database. Basically that's the direction in which

35:41

the big data world is shifting to date. So we're moving from the so-called row-based databases, the relational databases, to the columnar ones. So far there are two families, two categories, two kinds of columnar database.

36:01

The group A approach, which is the Google table, HBase, or Cassandra, which is a sort of data model which is based on multi-dimensional mapping, rather than the group B, which is the one choosing from these other tools here, which is the relational data model.

36:22

So basically the difference is that you have data organized in columns rather than in rows and that's very useful when you have to deal with analytics, because most of the time you end up analyzing data, going through columns rather than rows and that's very efficient and the tool I want to show you is this one.

36:43

It is called MonetDB and the reason why I'm showing you this is that it basically ships with a built-in Python support. So basically you have, indeed, you have Python plus R built-in support in it.

37:00

So you may write inside the database Python or R code for your analytics. In fact, the MonetDB type are directly mapped to NumPy arrays. So when you have to process columns in your DB, they're out-of-the-box transformed in NumPy arrays. So you leverage NumPy processing in it.

37:21

That's very nice to use. For instance here, we are executing a query here that returns a table. That's a function directly included inside the DB so that's working in the DB process. We are creating a new table here that has just one column of float

37:42

and the language of choice is Python, of course, and we're basically creating a random array of NumPy values and we're returning the values and that's it. So basically you have an output on a MonetDB table. To make it working and to see it working

38:01

in a more concrete example, let's say here we have two functions here in MonetDB and here we're basically leveraging all the functions of scikit-learn here. So we're basically writing Python code in it. Here we have the confusion matrix for some processing and then we have more details,

38:23

so more statistics on the confusion matrix. We're creating a new table with all the information we want to plot here. We have accuracy, precision, sensitivity, specificity, and F1, we are storing all this information in a very Pythonic way here because it's Python here working in the DB and that's it.

38:42

So we return the value here and the way we use it is just included in a query. So it's a simple SQL query here. So we select the value from the two table in a nested query and we pass the value gathering the data from another query and that's very easy to use.

39:02

Of course, it's a very quick sample. I highly suggest you to check out the D stock from which of these couple of slides have been accurate in database analytics with Python MonetDB. Yeah, thank you.

39:21

Okay, so that's basically the end. So a couple of things before closing, I want to show is that a couple of things were mixed along the way basically. It's more tools rather than format actually and I want to point you to a couple of tools

39:41

very interesting and very easy to use that now belongs to the PyData ecosystem. Those tools are the X-ray and the Blaze tool. Blaze is fantastic and it's basically a sort of one tool for all the formats.

40:02

So basically, I'm gonna show you a couple of examples in the next slide. And the X-ray is a sort of extension. You can think of it as an intermediate way from the NumPy structure and the Pandas data frame because the X-ray is basically a labeled ND array.

40:22

So the idea is I want to have a multi-dimensional ND array NumPy array but I want to describe the value and the columns and the rows I have in it. So I want to access the rows or the columns by name rather than just by index.

40:41

So that's the label of the array and it's a library based on the so-called netcdf format. It's a very quite popular format in case you're in physics and it's based on a common data model that's called.

41:02

A common data model that basically allows you to integrate HDF file, HDFS or other formats in one single data format and that's very useful. Okay, so Blaze, some guys in the ecosystem

41:21

consider Blaze a sort of extension of NumPy, sort of I guess, because it allows you to out of core processing which is basically one of the limitations you have when you have to deal with NumPy. In this couple of examples taken from the documentation here you may create the data object here from Blaze

41:44

which is basically talking to a database here rather than a Pandas data frame and that's basically the same for you. The same for you when you're dealing with the code.

42:01

In case for the x-array here you may create a data array gathering data from Pandas data frame rather than a NumPy array here and basically you operate over the data just like a NumPy, just like a NumPy array. So I think that's it, so just in conclusion I would say complicated data require complicated format.

42:23

Complicated formats require very good tools but fortunately we have Python and all the PyData ecosystem for that to tackle all these problems. So thank you very much for your attention.

42:40

Yeah, thank you very much Valerio. Unfortunately we don't have any time for questions. Next session's coming up but I'm sure Valerio will be happy to answer outside. Thank you very much. I can't be on time, I can't be.