Data Analysis and Visualization with Python
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Part Number | 7 | |
Number of Parts | 59 | |
Author | ||
License | CC Attribution - NonCommercial 2.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/19628 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
FrOSCon 20147 / 59
1
2
3
4
8
11
21
23
26
29
30
34
35
36
37
38
39
41
42
43
45
46
50
52
53
54
56
57
58
00:00
Open setSoftwareFreewareVisualization (computer graphics)Data analysisComputational scienceProjective planeUniverse (mathematics)SoftwareMultiplication signPhysicalismTheorySimulationSelf-organizationData analysisAudiovisualisierungXMLUMLLecture/Conference
01:10
Data analysisLaptopHill differential equationDecision tree learningObject (grammar)E-learningConditional-access modulePersonal digital assistantSubject indexingRow (database)Form (programming)Bit rateObject (grammar)AreaBuildingBlock (periodic table)CodeType theory1 (number)Operator (mathematics)Array data structureSlide ruleNumberMultiplication signRange (statistics)Data analysisChemical polarityGoodness of fitLibrary (computing)Chord (peer-to-peer)Element (mathematics)PlotterData typeOrder (biology)Random number generationLevel (video gaming)Electronic mailing listError messageInterface (computing)SequenceLaptopSineDebuggerWeb 2.0Office suiteSubsetComplex numberProjective planeCASE <Informatik>Program slicingInterpreter (computing)Web browserAuthorization
08:52
Subject indexingSubsetRow (database)AreaMultiplication signProgram slicingLecture/Conference
09:46
Boolean algebraSubject indexingDualismDuality (mathematics)AreaType theoryMathematicsMatrix (mathematics)Subject indexingPresentation of a groupShape (magazine)NumberObject (grammar)Boolean algebraEndliche ModelltheorieXML
11:01
Plot (narrative)Library (computing)LaptopPlotterDecimalComputer fontSimultaneous localization and mappingRadiusWordObject (grammar)PlotterFigurate numberWeb browserCartesian coordinate systemType theoryHistogramInternet service providerSineInterface (computing)Row (database)BitSlide ruleUniform resource locatorAlgorithmString (computer science)MereologyLibrary (computing)Linear algebraElectronic mailing listCodeLaptopSign (mathematics)Functional (mathematics)Position operatorNumberSet (mathematics)Execution unitUniformer RaumLine (geometry)MathematicsComputer configurationAreaMathematical analysisReynolds numberRandomizationLecture/Conference
17:49
Distribution (mathematics)NumberSummierbarkeitNormal distributionSubject indexingProbability distributionPlotterScatteringCASE <Informatik>ResultantString (computer science)Binary fileLine (geometry)Product (business)Right angleDot productGraph coloringLecture/Conference
19:08
Inheritance (object-oriented programming)HTTP cookieExecution unitInstallable File SystemDynamic random-access memoryHand fanPlot (narrative)Object (grammar)Data analysisMathematicsComputer fontConfiguration spaceFile formatPlotterMultiplication signNormal (geometry)Asynchronous Transfer ModeSeries (mathematics)SpreadsheetFrame problemString (computer science)Electronic mailing listSubject indexingTransformation (genetics)Object (grammar)Functional (mathematics)Different (Kate Ryan album)Library (computing)Data structureAreaData typeEvent horizonPressureRandom number generationGoodness of fitField (computer science)Type theoryTerm (mathematics)CASE <Informatik>Probability density functionRight angleCopyright infringementHypermediaStructural loadLatin squareComputer animation
24:18
WebsiteSubject indexingObject (grammar)Subject indexingFrame problemRow (database)Random number generationPoisson-KlammerAreaParameter (computer programming)SpreadsheetArithmetic meanMoment (mathematics)State of matterType theoryQuicksortFile formatAttribute grammarSymbol table2 (number)Series (mathematics)Program slicingSubsetPrice indexLecture/ConferenceComputer animation
27:04
Lambda calculusGamma functionAreaTwin primeInclusion mapPrice indexChi-squared distributionMaxima and minimaMereologyHill differential equationHost Identity ProtocolSimultaneous localization and mappingWechselseitige InformationWater vaporElement (mathematics)Series (mathematics)PlotterString (computer science)Drop (liquid)Row (database)Electronic mailing listFrame problemDifferent (Kate Ryan album)NumberSummierbarkeitType theoryTimestampCoroutineKey (cryptography)Contrast (vision)Time seriesDimensional analysisBijectionShape (magazine)CASE <Informatik>Functional (mathematics)Subject indexingSource codeCartesian coordinate systemLibrary (computing)Mechanism designStatement (computer science)Goodness of fitArithmetic meanObject (grammar)DemoscenePower (physics)Set (mathematics)Social classMultiplication signReading (process)Descriptive statisticsStandard deviationRule of inferenceAreaSpecial unitary groupStatisticsParameter (computer programming)Replication (computing)Lecture/ConferenceComputer animation
36:35
Category of beingResultantMultiplication signStatement (computer science)Morley's categoricity theoremString (computer science)Electronic mailing listDescriptive statisticsLecture/Conference
37:36
Mechanism designAsynchronous Transfer ModeEmailNormed vector spaceCategory of beingCartesian coordinate systemBit rateLogicPrice indexAbelian categoryLatent heatSeries (mathematics)FrequencyGroup actionCategory of beingPie chartCASE <Informatik>Dot productSummierbarkeitParameter (computer programming)Multiplication signSlide ruleNumberFrame problemCoroutineMechanism designObject (grammar)PlotterRow (database)PiElectronic mailing listDescriptive statisticsSubject indexingString (computer science)Power (physics)Line (geometry)SubsetMorley's categoricity theoremSolid geometryLatent heatMathematicsHypermediaFood energyRandomizationComputer animation
42:45
FrequencyWeb pagePerturbation theoryMenu (computing)Maxima and minimaLocal GroupTexture mappingPersonal digital assistantEmailAlpha (investment)Group actionCategory of beingFrequencyTransformation (genetics)SummierbarkeitData analysisLoop (music)MathematicsType theoryMultiplication signTimestampFrame problemData conversionCorrespondence (mathematics)PlotterMechanism designRow (database)CoroutineResampling (statistics)Series (mathematics)Arithmetic meanSampling (statistics)Parameter (computer programming)Auto mechanicData typeTemplate (C++)1 (number)Shift operatorLatent heatComputer animation
47:54
Convex hullLink (knot theory)Order (biology)HierarchyVector spaceProduct (business)String (computer science)MathematicsSoftwareMultiplication signProcess (computing)Covering spaceMeasurementData analysisOperator (mathematics)Parameter (computer programming)RoboticsType theoryLaptopGoodness of fitProjective planeExpert systemSocial classArray data structureRight angleParallel portCodeQuicksortTensorPressureWhiteboardWebsiteComputer programmingDistanceDerivation (linguistics)Cross-correlationWordField (computer science)Form (programming)Software maintenanceComputer animationLecture/Conference
55:11
Computer animation
Transcript: English(auto-generated)
00:08
Welcome, everybody. Today I will talk about data analysis and visualization with Python. I work at the German Aerospace Center in the Institute of Simulation and Software
00:23
Technology. Before that, I did my PhD in theoretical physics at University of Bonn. And during that time, I heard a lot of talks about Python and scientific computing with Python.
00:42
But I never really got the chance in my work to do it. So I decided to do a little private project on my own to learn this stuff because I think it's really beautiful and easy to use.
01:01
Yeah, and this is what I'm going to tell you about here today. So first of all, I will give a short introduction to NumPy, which is the basic building block of all scientific Python libraries I know.
01:24
And afterwards, I will show you how to do publication quality plotting with matplotlib. And then I will proceed to Pandas. Pandas is a library for data analysis,
01:41
which was written by Wes McKinney in order to analyze financial data. But you can also do other stuff with it. And in the end, I will show a Pandas use case,
02:00
which was the goal of my project, to analyze my personal expenses, so to find out how much money I spent on food and clothes or something like this. OK, so everything I will tell you here I got from this book.
02:23
It's from the author of Pandas, Wes McKinney. It's a really great book, so I can only recommend it if you're interested. And the next thing I recommend to you, if you haven't already heard about it, is the IPython notebook.
02:47
I will just briefly show you what it is. So as you see, my talk is web-based.
03:04
So this is a browser here, and I actually did this talk, the slides for the talk in IPython. So IPython is just a really nice interface to Python, where you can do nearly everything.
03:21
For example, slides for a talk. Yeah, and you can type in Python commands here. Type import numpy snp, and then I will show you later what this is, and you get out something.
03:55
So this is just a brief stop here, and then I will proceed.
04:03
So as I said, I will begin with numpy. So numpy is very good for fast vectorized arithmetic operations, because under the hood, it's written in C.
04:22
And numpy also provides tools for integrating code which is written in C, C++, or Fortran. And as I already said, it's the basic building block of all the scientific libraries in Python. And as I said, since I will show you
04:44
a lot of code in my talk, I import here the numpy library as np. And in all the following slides, whenever you see np, this is numpy, of course, because this is really
05:00
all the code you see is a sequence of commands which are given to the Python interpreter. So the main object in numpy is the array.
05:21
So the array is a container for homogeneous data. As you can see here, you can create it with np for numpy array. And then you give it, in this example, a list of lists. And you give it a data type. So you have homogeneous data, because under the hood,
05:41
it's a C array. This is the reason why it's fast. And you get a numpy array. So what you can do with a numpy array, for example, you can do vectorized arithmetic operations. So I want to multiply all the numbers in the array with 10.
06:03
And instead of writing a for loop, which runs over the rows and the columns, I just type data times 10. And then I have an array with the same size, and every number was multiplied by 10.
06:21
You can also do more sophisticated operations, as you can see on the last command on this slide, where I applied the sine function to the data and something else. And yeah, this is vectorized operations. And they are really fast, because as I said,
06:41
already under the hood, this is written in C. Numpy also provides easy creation and reshaping of arrays. For example, if you want to create an array with two rows and three columns with ones, filled with ones,
07:04
then you can use this command. You can also arrange. You can create just a range from 0 to some number easily, and you can reshape. As I see here, I take this list from 0 to 5
07:23
and reshape it into a two-dimensional object with two rows and three columns. Numpy also provides random number generation, as you can see here.
07:41
And the good thing about Numpy is easy slicing and indexing. Here you have an array from 0 to 9. And like in C, you can access the elements of this array with this command. So I want the fourth element of this array,
08:02
and I get it like this. Here you can see slicing. If I want the elements 6, 7, and 8, I can do it like this. And you can also do more fancy slicing. In the last command, with the last command,
08:23
I get every second element of the array. You can also do this for multi-dimensional arrays. Here I have an array with three rows and four columns.
08:41
And if I want, for example, to get a subset, the subset here of the array, then I can do it like this. So the first command here describes the subset of the rows and the second one of the columns.
09:02
So I want row 1 and 2, and column 1, 2, and 3, and I get it like this. And you can even do more fancy indexing and slicing. For example, if I want the 0, the 10, and the 11,
09:21
I can do something like this. So I get the index pair, not the 0, I want the 1, which is the index 0, 1. And I want the 10, this is the index 2, 2. And I want the 11, which is 2, 3. And I get it.
09:41
So you can get every subset of the array. You can do Boolean indexing. For example, if I want all the data in this array which is greater than 4, I just type data, data greater 4.
10:02
And I get all the numbers which are greater than 4. And note that the shape of the array has changed, because the data I cut out cannot be pressed into the same size and cannot be converted to a matrix-shaped
10:28
data object like before. I can also do more math-like stuff. For example, I want to cut out the 0, the 3, the 9,
10:42
and the 6. And I can do it like this. So if I divide the data by 3, and if I do model of 3, and if it's not 0, then give something back. So you get an array where the 0, the 3, the 6, and the 9
11:02
is cut out. I can also write the data like this. As you can see here, I replace the 0, the 3, the 6, and the 9 by the value 100. I would skip this. You can also do linear algebra with NumPy.
11:23
So now we'll come to matplotlib. As I said, this is a library for publication quality plots. And it's highly configurable, which means that because it's aimed for publications, for scientific publications, you want
11:42
to configure every bit of your plot, your figure in your publications. This makes it somehow difficult to learn. But there is also, if you come from, I mean, the idea of matplotlib is
12:00
that you have a matlab-like interface. And there is also matlab-like interface provided. So if you know matlab, then you can easily switch to matplotlib. Because I don't know matlab, I will not show you that here. I will show you the hard way in some way.
12:27
And in the following, I will import matplotlib as PLT. Or the part from matplotlib that I need, I will import as PLT. And the first thing I want to do in my IPython notebook
12:42
is matplotlib magic. So if I type in matplotlib inline, then all the plots will show up inside of my notebook. So inside of the browser.
13:04
And plots in matplotlib, they all live inside a figure object. So you create it like this PLT figure. Then you get the figure object. And you have to put some plots into the figure
13:21
with the add subplot command. And you have to give this command a grid of subplots. So you have to give the number of rows, the number of columns, and the position in the grid. In this example, I only want one subplot.
13:44
So I give one row, one column, and position one. And I get back an axis object. And this is the object where you plot things into. So if you have more subplots, then you have different axes. And you can specify where you plot your stuff
14:04
by using it like this. So you take the axis objects and execute the subprotein plots. And you can see here, again, some numpy stuff. I created an array x, which runs from 0 to 3 pi
14:28
with 1,000 steps. This is this linspace command. And I want to calculate now the sine x squared of this. And again, I can apply the sine, the numpy sine function
14:44
on top of this array and get a new array y, which now contains the values of the function. And if I then take the axis and execute plot, I get such a thing.
15:02
And you can see this looks really nice, I think. But there is something missing, which is the labels. And the next cool thing about Matplotlib is that you can put LaTeX rendered labels in it.
15:24
So forget about all the stuff above. This is the same as in the last slide. So first of all, I will label this plot with this string here. And if I include these dollar signs here, I can put in LaTeX code.
15:43
And this will get rendered like this, as you can see here. I can also specify the ticks. So on the x-axis, I want to have 0 pi, 2 pi, and 3 pi. And I do it like this. So I set the x ticks by giving a list with 0 pi, 2 pi,
16:11
and 3 pi, as you can see here. And I want to label it with the appropriate labels. And this I do with x ticks labels.
16:25
And here I also use LaTeX commands, as you can see here, inside this list of strings. I can set a title, my plot. And I put a legend somewhere.
16:43
So there's this option best. So there's an algorithm which finds the best location for the legend. And the last thing, I label the x-axis with LaTeX rendered x character.
17:03
And I think you can see this is with a few commands. You get a nice-looking slide for your publication, a nice-looking plot for your publication. You can do more here. I have an example for three subplots.
17:23
As you can see here, I again get the figure object and create three subplots with one row and three columns. And I plot now three different things in all the three subplots.
17:40
The first one is a histogram. Here I give the histogram an array. And this array is a normal distribution, 100 numbers of the normal distribution. I give it 20 bins. I can also specify the color easily and the transparency.
18:04
And the result is shown here. I can also do a scatter plot, where I give it the numbers from 0 to 29 and the numbers from 0 to 29. And on top, Gaussian distribution.
18:22
So I get a spread here. And the scatter plot makes these little dots here. And the last thing here on the right side is, again, a random distribution and the cumulative sum, which is also provided by NumPy.
18:41
And I want to plot it in red with a dashed line. And I can do this very conveniently with giving this string here r and minus minus. And you can also see that the plot command, in this case, only needs y values.
19:02
And the x values are just the numbers from the index of this array. So from 0 to 59. Yeah? Are the plots exported as PDFs?
19:20
You can export it in different formats. I'm not sure what's possible. But I think you can do PDF, SVG, all the things you need for scientific publications. What I'm concerned about is are the same plot as paper? Or do I have to configure my Python
19:43
in the same way that I configure my LaTeX and replicate all that configuration code? You mean if you would do publication in LaTeX? Or what do you mean? I'm writing a paper on LaTeX, right? And then I specify that I want a particular font. OK, I see. Yeah, as I said, I have never done this.
20:02
I'm not sure about that. Yes? So you replicate all of my books and packages
20:20
and use IOT for a long time, so I can read all my LaTeX preamble. Does it only accept LaTeX in?
20:47
That's a good question. I'm not sure about that. Yeah, but that's math mode. The question aims, can you include normal LaTeX text?
21:04
Yeah, I'm not sure. I didn't write it.
21:29
Yeah, but not the plot, not the labels in the plot. I mean, the question was, for example, here, if I want this my plot title written in LaTeX, right?
21:44
In LaTeX form, if I can render this. But I'm not sure if this is possible. We can try it out later, maybe, if there is time. I can just switch to the notebook. OK, so now I will come to Pandas.
22:06
So as I said, Pandas is a library for data analysis. And the first obvious thing that you will see is that the data structures are exactly like NumPy arrays,
22:25
but there is also an index label and a column label. And you have support integrated for time series, which is especially important for my problem,
22:40
because I have expenses at different dates, and I want to analyze it. Pandas can also handle missing data, which NumPy can't. And it has functions for sophisticated data transformation. And the following I will import Pandas as pd, as shown here.
23:06
So the main objects in Pandas are series and data frames. If I have one dimensional data, I use a series. As you can see here, I create a Pandas series
23:20
by giving it a NumPy array of three random numbers. And I also provide an index, which is a list of strings in this case, but I can also use other types, not only strings, but also integers or floats.
23:40
And if I plot it, you can see that there is this index, and the data, and the data type. So that's a series. And the other thing is the data frame for multi-dimensional data. You can imagine this as a Excel spreadsheet.
24:04
So you have two dimensional data, and you have an index, and you have named columns. And you create it like this. You Pandas data frame, you give it a NumPy array. So again, I create six random numbers,
24:23
reshape them into the two dimensional objects with two rows and three columns. Then I give the column names Alice, Bob, and Charles, and I give the index names one and two. So this is the data frame object, which is the main data
24:46
object in Pandas. And what can I do with it? First of all, I can select columns in this spreadsheet. For example, by giving a label inside these brackets.
25:05
So I want to go back for a moment. I want this column here, and I can do it like this. And you can see that this gives me back a series. So if I want to know what is the type of this object,
25:22
you can see that this is a series. It's a Pandas series. You can get all the subsets in this data frame with the ix command. And the good thing about it is that you can not only
25:43
use indices, but also labels. As you can see here, I will go back for a moment. I want to get this row here, the row one. And I can give the ix command in the first argument,
26:06
label one. And in the second argument, I give it indices. If you know Python, this symbol indicates that you get all the indices in the array.
26:21
And indeed, this gives me back this row. But now labels transferred the column names to index names. So now I have Alice, Bob, and Charles as index labels here. And there is also a name attribute
26:44
in the series, which is then, of course, the former index label one. I can also do, as I said, I can also do index-based slicing.
27:01
So here I get the zeroth row and the zeroth and second elements in the column. So I get a series of two elements. What I can also do in pandas is function application.
27:20
So I want to apply some function on every element in the data frame. And I can do this with the apply function. So I define a function which adds 100 to all the elements. And I just apply it onto the data frame. And I get a data frame with all the numbers increased by 100.
27:45
There are also included functions for statistic, like sum or mean. And you can see the sum just sums up all the columns and gives you back.
28:00
So it sums up all the columns and gives you back in a series with the index Alice, Bob, and Charles. And the data is now the sum of each column.
28:20
Yeah. The next thing you can do is merge data of different shapes. And this is a really cool thing about pandas. So here you can see two different data frames with different dimensions. But they share a key. Now here I have data one and keys, A, A, B, A, C.
28:44
And here I have data two also with keys. And if you want to merge them somehow, what would be the natural thing to do? So they share a key.
29:00
Therefore, if I want to merge them, I get something like this. I have now three columns, data two, data one, and a key, which was shared between these two data frames.
29:24
Sorry, I forgot something. The special thing about this data frame is that there is a one-to-one correspondence between the data and the key. So pandas will recognize this, and will recognize
29:45
this one-to-one correspondence. So it will set, and whenever there is a key A, then data two is three. And this is exactly what happens when you merge this. When you merge this, so whenever the key is A,
30:01
then data two is three. So it's somehow intelligent merging. And there are a lot of stuff like this. So this is really powerful data bringing mechanisms included in pandas.
30:21
You can also concatenate data. Here you can see two data frames of different shape. Here you have four columns, and here you have three columns. And if I want to concatenate them, I will glue them together like this.
30:43
And I get something like this. You can see here that despite the fact that these two source data frames had different dimensions, the concatenation works. But because there was no data in the fourth column
31:02
of the second data frame, it will include not a number. I still have 25 minutes, right? OK, good. OK, and here you can see another cool thing
31:24
about pandas, which is the handling of missing data. So it will just infer that this is the right thing to do. So I want to concatenate two things which are not compatible at first sight, but you can do it with pandas.
31:40
And it will just fill in missing values with NaAs, NaNs. If I do something like this, maybe I want to have now two rows with the same data, and I want to drop it.
32:00
You can also do it very easily. You can just use the drop duplicate routine, and you give it as an argument. You give it the string of the column A, so in this case, A.
32:22
So it will search if A is equal in two columns, it will drop one of them. And this is exactly what happens here. And also, you can see that it keeps the one with the number here in the fourth column.
32:42
OK. For my problem, it was really important that I can analyze time series. And this is also completely included in pandas. Here, I take the data frame from the slide
33:01
before and rewrite the index. So the index is now from the daytime library. So I import daytime and create a new index by giving a list of datetimes, as you can see here,
33:22
and passing it to the index of the data frame object. I also give a name for the index, dates, and a name for the column class. And you can see that it now appears here, the name for the index and the name for the columns.
33:42
And indeed, if I look at the 0th entry of the index array, you can see that this is a timestamp of the type daytime. Yeah.
34:02
The next thing is when I want to plot this data in contrast to NumPy, pandas provides more convenient plotting routines. So on top of matplotlib, pandas provides more convenient routines.
34:22
If you want to plot this data, you have already an index of dates. You have a name for this date, for this axis, and a name for this axis. And it would be nice if you can plot it and don't have to give the axis names and the labels.
34:42
This is exactly what pandas does. So if I take this data frame, df3 now, and type plot, then it automatically does all the labeling for me. As you can see here, I had four columns. I had even one column with NaNs in it, but it works.
35:05
So you have a plot. You have the dates as labels on the x-axis. You have the x-axis label. And you also have the label of the columns here
35:24
with just one command plot. And this is really nice about pandas. So you do not have to, if you are lazy, you can just hit plot and everything is there. Okay. And now I will come to the use case.
35:42
So why I did all this stuff. So I wanted to analyze my personal expenses. And they look like this. If you get an account statement from your bank,
36:01
you get a date, the amount of money which was transferred, and some description. And I anonymized my data here. So this is my data, but somehow anonymized.
36:21
And the first thing you have to do is you have to categorize all this stuff, because I want to know how much money I spent on different stuffs. And this I've done with, I've written a small tool which is called PyAccount, which does this.
36:41
So it reads in your account statements from your bank and asks you, it prints out the description and asks you which category is it. And you can filter it. And then you can create filter lists. So every time there is ID in the string here,
37:04
then it's food, for example. I will not show you how this works in detail, but the result is this. So on top of the data from the bank, which are these three columns, I get the fourth column,
37:23
which is the category. So all the items in this data are now categorized. And the first thing I want to know is how much money did I spend on some category.
37:42
And Pandas provides a very useful tool for this. And this is the group by object. So I have this data frame from the slide before. And I hit group by category and calculate the sum. And I get exactly what I want in one step.
38:01
So I have now in the row index is now the category. It's named correctly. And I get the sum of all these categories in the first column.
38:21
And again, I can plot this really easily with Pandas. So first, what I have to do now, I don't need this description anymore. So I have to get the subset, the serious subset, which is the first column.
38:42
And I do this with the ix command. So I have this data frame, group sums from the slide before. I use this ix command and give it a list of categories in the first argument.
39:02
So I only want to categorize car, cash, food, kids, media, restaurant, sports. And the second one is value, because I only need the value column. So this pi data on the left hand side is just a serious, which now contains the data I want.
39:23
And I can plot it just with plots. And I want to make a pie chart. So I say kind pie. And I give it a title. And it does everything for me. So I have a nice picture with one command. And I can see that I eat a lot.
39:47
Yeah, as I said, this is anonymized. It's not really my data. And it changes every time I create the slides,
40:00
because it's random. OK, yeah, you can also, but the data here now shows the expenses on food on the whole time span. Of course, I want to analyze how much did I
40:23
pay on food in January, or did it change over the month, or something like this. And therefore, I can restrict the data to a specific time span. And this is also a very cool feature about pandas. I can do it like this.
40:41
I can give it just a string, which gets recognized as a date. And I want to slice out from 1st February to 1st April, and do the same group by category and sum routine. And now I get the same kind of data.
41:01
But now you can see the numbers are smaller, because it's only for two months instead of six months. The data before was for six months. I can also look at the expenses over the time
41:21
with the group by mechanism. So I take the data frame, group by category, and get now only one group, the food group in this example. And I get a data frame, which now only contains the category food. And again, I can plot it in a nice way.
41:40
Here you can see, again, this convenient plotting style argument. I want to have dots and lines, and it's just O minus. And I get a nice plot with the right labels.
42:01
OK, this plot doesn't say that much, but you can do it for other stuff. And this can be interesting. OK, so as I said, I want to get the monthly expenses
42:22
of food or car or something like this. And therefore, I have to sum up all the expenses in a given month and aggregate it. And this can be done in this way. So I go back here.
42:42
So I have food in this case. And now I want, for example, this is both December, so I want to sum this up here. And this can be done by this resample routine. So I say resample, this M stands for month.
43:01
And how, sum. You can also do mean or something else. But I want sum, so I do it like this. And I get a series where now I always have the end of the month in the date row. And I have the sum for the monthly expenses
43:22
on food in the first column. Since I don't want to have specific dates here, but I want to have time spans. So the data corresponds not to this day, 30th of November, 2013, but it corresponds to November, 2013.
43:48
And therefore, in pandas, you can easily convert time stamps to periods. And this is just done with this single command.
44:00
So you take the data frame and try to make two periods. And then you get the same data, but now here you have another data type, which is not a time stamp anymore, but a time period.
44:21
And now I have what I want. I have the monthly expenses of food over six months. I only show four here. OK, and again, I can plot it. So I just type plot.
44:41
And I get a nice plot, and I can see what I did. Yeah, you can do more fancy stuff. For example, calculate the relative change in the monthly food expenses by taking the monthly food
45:02
expenses and divide it by the data shifted one month earlier with this shift minus one. And then you subtract one to get a relative change just with one command.
45:21
And you can see that it gives you a series, and you can plot it. And you can do much more with pandas, of course. And the last thing, I don't want this for food. I want this for a lot of categories and see how they changed over the time.
45:42
And so I want something like this. Here you can see that with these commands, I get what I want. So I create a new data frame, me.
46:02
And now I loop over all these categories that I am interested in. Then I use this group by mechanism to get the data for the specific group. And then I resample it in the same way with month and sum.
46:22
And again, here I do this transformation to periods in one step with this kind argument. And here I append each column to this whole data frame in the last command.
46:41
And if I show now the data frame, I have exactly what I want. I have the expenses for car in November, December, January, February, and so on. And again, this can be plotted with a single command in a nice way. This time I type plot and kind, bar plot, horizontal.
47:07
I want it stacked, so on top of each other. And I give the transparency value. And I don't have to label anything. Everything is there. And I now see where my money went in a really nice way.
47:27
OK, I hope I've shown you that this Pandas is a really nice tool for data analysis.
47:41
And it's really accessible and easy to use. Questions?
48:10
The question was how the performance is for big data. I heard it's quite good. But I'm no expert in this field.
48:20
Because as I said, I did this as a private project for small data. But I heard that this is a topic on Pi data conference or something like this. I think it's also useful for big data. Yes?
48:51
Yeah. OK, so the performance is good. Yeah.
49:00
Yes. I don't know R, so I cannot answer the question. I just know Python, and I love Python. I cannot answer it. It's just a matter of taste, I think.
49:25
I mean, the good thing about Python and this IPython notebook is that you can do it and it aims to be a tool for the whole scientific workflow so that you can do everything. You can do programming.
49:40
You can do scientific publications. You can write a paper in it. This is the goal of the IPython notebook. I think it is one of the goals. And yeah, it looks really good. So it's not there yet. But to my experience, it's just maybe this
50:01
is a reason to use Python for data analysis. Yeah. You don't have to do it.
50:21
I just. OK, it's still a library, right? And so Python doesn't really understand that all of these operations are vector operations. And so as an interpreter, it's unable to go ahead and do useful operations such as combining things or anything that's typical to parallel operations like string production and pulling things up
50:42
and pushing them down in a hierarchy in order to optimize it. And so your argument that it's so fast, it's like C, doesn't seem to make sense to me, right? It's fast like C as long as you're talking about very simple operations.
51:01
Yeah, OK. But then there is, I mean, it does not, pandas and NumPy do not cover everything, of course. But I think then you can use Syson or something like this. Yeah, but it's still in there.
51:25
Yeah, maybe. I'm going to respond to what you said. Yeah, yeah. OK.
51:49
Yes? I have a big thought.
52:03
You have to control the process for how the data is live in the network. I am not sure. I'm really new to this stuff. Should we then do an array? Yeah. Are there any other?
52:53
Then you have to get the arrays from genius.
53:01
Is that the buffer? Yeah. You get the array. You probably start using performance if you're writing about first class operations. I mean, if you start doing more complicated things, you might be in one vector with another vector. And you've got tensors and all sorts of weird things like that, your performance is going to be a really bad hit.
53:23
You know, it's not, it couldn't seem to perform. As you write C code, it's not the same as C code. That means making kicks. And why? Who wants to know that tempo? I mean, you don't. But you're right.
53:41
You can only do, like, the array manipulations. When they're ready to do it through a loop, it's just rather a loose thing, right? So if you have change vectors in Python 4 groups and arrays, if you go a million and get really, really slow, they need to go to something with recysin, which is none of ours.
54:00
It's yellow. This is a certain type of performance. It's very good. A lot of type of performance you might have in your mind. You might have to get the performance. But you always have to measure. Because you can't really tell from the beginning. It's not that easy to say against the robot's throat. You have to choose things and measure them.
54:21
You can say, this is much slower than mine. I don't think it's that slow. Those people, the guy who wrote it, he's very, very good at performance. He's running all the time. I think every time he's in operation, he's pretty good at performance. But I like this.
54:40
Diversity, you have the right to see things. Now, they work. Sizer. Okay.
55:01
I think there are no more questions. Then thank you again.