STAR: a Python Pandas dressing
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 90 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/40324 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSDEM 201317 / 90
2
5
8
10
12
13
14
15
17
19
21
24
25
28
29
31
32
34
36
39
40
43
44
46
50
51
52
54
55
57
58
62
65
66
67
78
79
87
88
00:00
Projective planeWordLibrary (computing)Lecture/Conference
00:20
MultiplicationLibrary (computing)Data analysisMatrix (mathematics)Data structureMathematical analysisFrame problemLibrary (computing)Formal languagePoint (geometry)Data analysisData typeDifferent (Kate Ryan album)View (database)Computer animation
01:14
Subject indexingLevel (video gaming)Different (Kate Ryan album)Data typeMultiplicationTime seriesView (database)Frame problemMathematical optimizationPoint (geometry)Lecture/Conference
01:37
Local GroupCircleMathematical analysisUrinary bladderMilitary operationData analysisInterface (computing)Level (video gaming)Dimensional analysisStatisticsBit rateData structureScripting languageDefault (computer science)Structural loadPhysical systemFunction (mathematics)Type theoryPrice indexDatabaseMultiplication signData structureProjective planeFrame problemMathematical analysisCASE <Informatik>MetadataRaw image formatSoftware testingDimensional analysisPhysical systemFunctional (mathematics)Goodness of fitGroup actionComputer fileDifferent (Kate Ryan album)Computer scienceNumeral (linguistics)ResultantTelecommunicationRight angleCycle (graph theory)CodeSubject indexingType theoryData dictionarySoftware frameworkNeuroinformatikDefault (computer science)ConsistencyMatrix (mathematics)Structural loadObject (grammar)Traffic reportingOperator (mathematics)Moving averageTime seriesScalabilityStatisticsData managementData typeComputer animation
06:36
MathematicsDimensional analysisLecture/Conference
06:46
Level (video gaming)Dimensional analysisFunction (mathematics)Urinary bladderTemplate (C++)Scripting languageParameter (computer programming)Graph (mathematics)PlotterLevel (video gaming)Table (information)Module (mathematics)Traffic reportingSpacetimePhysical systemShape (magazine)MathematicsBuildingGraph (mathematics)Multiplication signForm (programming)Graph (mathematics)Data structureStack (abstract data type)Functional (mathematics)Operator (mathematics)System callCASE <Informatik>Projective planeType theoryDimensional analysisDifferent (Kate Ryan album)Representation (politics)Free variables and bound variablesTemplate (C++)CodeComputer animation
08:47
PlotterGraph (mathematics)Type theoryScatteringLecture/Conference
09:01
Interactive televisionMachine learningPlotterTemplate (C++)Form (programming)Row (database)System callScripting languageTraffic reportingMachine learningObject (grammar)Endliche ModelltheorieVirtual machineDifferent (Kate Ryan album)MultiplicationUMLProgram flowchart
10:24
Endliche ModelltheorieLecture/Conference
10:34
Range (statistics)XMLUMLLecture/Conference
Transcript: English(auto-generated)
00:01
My name is Marco, and I present you a small project I've been working on in the last month. The idea is to build a layer on top of Python Pandas, which I spent a few words to describe to you what it is for those who don't know. Pandas is a Python library
00:24
for making that analysis. It's, of course, written in Python, slash, oops, too early. And it's based on NumPy, and it is piled on R. R is a very popular, quite popular language
00:45
to make data analysis. And the problem with R is that is very specific. So Python Pandas is a library to have a more general purpose language to work with. It's very inspired
01:05
on R. The main data structure is the data frame, which is basically a heterogeneous matrix by the data type point of view. It can hold different data types in different columns.
01:22
And Pandas introduced some improvements to the data frame, which is multiple level indexes and some optimization when working with time series. There are many pros in using Pandas. Good performance because critical code is written in C, so it's very fast when
01:49
making merge operation, group by operation, and, of course, all the numerical function. It has intelligent data alignment because different columns are actually hold in different
02:08
matrices. Each one has its own data type, and so the alignment is kept consistent. But it is still quite hard to explain to someone who's not a computer scientist, someone
02:22
who has not a computer science background, how to efficiently work with such a tool. So, this is a small vicious cycle that I found myself in. Those that are interested in making that analysis usually don't know what to do, don't know how to make that analysis.
02:45
And these are journalists, company managers, and such. So they ask statisticians to make that analysis, but statisticians usually don't have the right tools to do so, and they have
03:01
to ask computer scientists to help them. If the communication fails, this is the result. Good analysis batch, which was the question. It actually happened to me. So the goal was to build a layer on top of Pandas to make things easier and to make
03:24
testing more scalable and reproducible. What do we have to do? We have to define which columns in data frames should be used as dimensions, which columns are calculated
03:43
from other columns, and introduce, do what I mean behavior in some function, specifically in merging operations and in aggregating operations, which we call roll-up.
04:02
Then there are some statistical functions, mainly used with time series, that are quite always the same, even though they are quite complex to implement to someone who has not a computer science background. And finally, there is no need of that analysis
04:23
if we cannot produce some reports of it, if we cannot present the result in a human-readable way. So the project consists basically in a metadata structure around the Pandas data frame. This metadata describes what I just said,
04:46
so gives a role to each column of a data frame, and an engine tries to implement to find out what exactly the user wants to do with this data.
05:05
So this is the basic first step to use Star. With Pandas, I can load data from many different types of files or databases. This
05:22
is just one case from a CSV file. Then I put my data frame inside a Star object, and the Star framework adds default metadata, or I can pass a metadata structure that I have built before. This metadata is basically built with Python
05:48
dictionaries. And in this dictionary, I describe the type of columns, so the dimensions, which can be used, and Python indexes, but the user don't know anything about this. So the system just defines the indexes when it's
06:05
the right time to do it. A numeric column, well, that's quite obvious. Immutable values are labels or other values that are always the same for a specific dimension. And, finally, elaboration. Elaborations are columns that are calculated
06:24
as a function of other columns, and when I do further elaboration on raw data, these must not be messed up. So they should be reevaluated when the original
06:43
data changes. This is just a basic idea behind dimensions. It's a typical Star structure, in which data are in the center, and dimensions, definitions,
07:01
and levels are linked together in a Star shape. Typically, this is used for aggregation operation and to change the data. Level. The typical case is city, country, and region,
07:21
which can be easily switched with just one function code. Elaboration, in Pandas, as I said, a new column can be defined as a function of other columns. This is just a basic example. And with Star, the same thing is defined with a slightly different syntax, which is
07:45
it tells the system how the column should be reevaluated every time the original values changes, or when I do aggregation that uses these values.
08:04
And, finally, there is a small reporting engine. This reporting engine lets you define templates in LaTeX or HTML. We live in placeholders for data representation. Data can be represented in tabular form
08:21
or graph. And there is the project to build an automatic text generator. Graphs, particularly, are interesting because I tried to leave space to build more modules based on Matplotlib and build different type
08:44
of graphs. By now, there are just simple type of graphs like simple plots, bar graphs, and scatter, but more can be added quite easily if you know Matplotlib API. This is the basic usage.
09:06
You call a Python script, start reporting engine, calling a folder containing both the template in LaTeX or HTML and data in form of Python pickles.
09:22
So you save your objects, your Star object as Python pickle in a folder, and you can easily produce a report. Or if you can, or if you want to produce multiple reports with different data set, but always the same template, you can, just
09:43
access in API of the Star reporting engine. What I want to do from here? First of all, as I said, the automated text generator should be implemented maybe
10:01
using Ckit-learn or some other machine learning tools. I actually don't know. More plotters, much more plotters, evaluated record handling, which is quite the same thing that evaluated columns, but should work on records instead of columns, and maybe integrated with start model to let define statistical
10:28
models to apply to the data. And that's pretty much it. Thank you very much for listening. If there are any questions?
10:57
The idea is something to analyze data and produce a descriptive text of it.
11:05
A short descriptive text of it. Actually, by now, there is a small feature that does this, but it just says it's, you define a range, some ranges, and it tells back if data are in a good range, in a medium range, or in a low range. That's it.