Bestand wählen

Data Analysis and Visualization with Python

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Erkannte Entitäten
come everybody and get today I will talk about the data analysis and his organization with piosson and I I work at the German Aerospace Center and the Institute of simulation software technology but before that I did my PhD in Theoretical Physics at the University of bonds and during that time I heard a lot of talks about all its Python and so scientific computing with Python but I never really got the chance In my work to do with it it so I decided to do do private project on my own to learn this stuff because I think it's really beautiful and easy to and yes and this is what I'm going to tell you about your the so 1st of all I I will give a
short introduction to non non-PPI which is the basic building block of all scientific Python libraries I know and afterwords I will show you how it to do publication-quality plotting with map plot and then I will proceed to ponder promised is a library for data analysis which was written by Wes McKinney In order to work we analyze financial data but you can also work do other stuff with it and in the end I will show a panelist was case which was the goal of my projects to analyze my personal expenses sold to find out how much money I spend on food and clothes or something like this OK so and the main everything I would tell you year I got from this book it's from the office of panels West McKinney so really great book so I can only recommended if you're interested and the next thing I recommend to you if you haven't already heard about it is the IPython notebook this I want to ask me if you show you what it is so on as is seen
my the talk like this In web
based so there's a problem here have actually did this talk this slides for the talk in I so IPython is just a really nice interface to Python where you can doing nearly everything for example the slides for for talk so yeah and you can type in Python commands you type in court non-PPI S and P and then I would show you later what is this yeah I yeah and you get out something OK so this is just a brief stop here
and then I would have so in a sense I was begin with non so is very good for faster vectorized arithmetic operations because under the hood it's written in in see the and so on From Austin I also provides tools for integrating code which is written in C is C + + or Fortran and as I already said it's the basic building block of although scientific libraries in Python and as I said since so I show you a lot of chords in my talk I imports here the nonpolar library and S and P and in all the following slides uh whenever you see and p this is number of course because this is really all the code you see as this is a sequence of commands uh which uh are given to 2 with the Python interface so the main objects in the time is the error rate so the there is a container for homogeneous data as you can see here you can create its with NP from Mumbai error rate and then you give it in this example a list of list and you give it a data type and so we have continuous data because under the hood it's a CRA this reason fast and and you get a number higher so what you can do with a number of areas for example you can do vectorized MITalk operations so I want to that and multiply all the time numbers in the area was that with 10 instead of writing uh forms both of which runs over the rows and the columns are just typed data times 10 and then I have the array of that was the same size and every number was modified by 10 you can also do more sophisticated operation as you can see in the on the last command on the slides uh where I applied the sine function to the data and something else and so on yeah this is a vectorized operations and they're really fast because there's a sense already on what this is written in C number also provides easy creation and reshaping of arrays for example if you want to create an area with 2 rows of and so 3 columns but with 1 the food with ones then you can use this command you can also arrange you can and creates just a range from 0 to some number are easy and you can reshape as necessary you take this this is from 0 2 5 and reshape it into the 2 dimensional objects with 2 the roles and from there 3 columns number also provides a random number generation as you can see here and the good thing about number is easy slicing and indexing them here you have an area from 0 to 9 and like in see you can access the elements of this area with this command so I have a 1 to the front end of the 4th Army element of this area and I get it with this and here you can see slicing I want 1 so the elements 6 7 and 8 I can't do it like this and so you can also do more fancy slicing and the last months I the with the with the last command I gets every 2nd elements of the array you can also this for multidimensional arrays here we have an array with 3 rows and or columns and for and if I want for example to get a subset the subset here of the area and then I can do it's like this so often the
1st time command here describes the subset of the rows and the 2nd 1 of the columns so I once role 1 and 2 and column 1 2 and 3 and I get it like this and so you can use to more fancy indexing and slicing and for example if I want 0 the 10 and 11 I can do something like this 1 so I get the index on another 0 I 1 the 1 which is the index and I want 10 to is the index true true and I want 11 which is 2 3 and I get it so you can get every subset of the area OK
around yeah I can do it only indexing for example if I want all the data in this area which is greater than for just type data data later for and I guess all the numbers which are at greater than 4 and note that the shape of the area has changed because of the data I cut out cannot present to the same size and that can be converted to a to a matrix shaped this beef or and I can also do more Math like stuff for example I want to cut out the zeros means and 19 and the 6th and and I can do it like this so if divided by the data 3 and so that it but when 103 and it's not 0 then give something back so you get an area where the zeros Sweden 6 in
the 9 this come about I can also write a data like this as you can see here I replaces the words we use 6 and 9 by the value 100 OK I would
skip this you cannot and linear algebra with known prior and so now I would come to matplotlib but as I said this is a library for publication-quality plots and it's highly configurable which means that you uh because it's aimed for publications for scientific publications you want to could figure these bits of your your plots you figure in your publications and this makes it somehow difficult to learn but there's also if you come from that I mean the idea of matplotlib is that you have a Matlab like interface and there is also a lot like interface provided so if you know market then you can use easily switch to matplotlib because I don't know matlab or I will show you that you i which we the hard way in some way OK OK and i in the following i will import matplotlib S P O T other parts from appropriate that I need I will report as unity and so the 1st thing I want to do in my personal book is much properly so I if I type in Math but lived in line then I all the plots would show up inside of my notebook so instead of the browser and yeah and and young plots in put they all live inside the Fig object so you create like this purity figure and then you get the figure objects and so you have to put some plots into the figure With the at subplots commands and so you have to give this command earned agree it's of subplots that so you have to give the number of rows and number of columns and position in the great In this example I only want 1 subplot so I give 1 row 1 column and position 1 and I get that in excess object and this is the object where you were plots things into so if you have more subplots and you have different exercise and you can specify where you plot your stuff by using is like this so you can take the the axis objects and execute the subroutine plots and you can see here that again some stuff I created an area acts which runs from 0 to 3 pi with the following steps this distance-based commands can and I want to calculate now the sign x square out of this and again I can apply the signed the known part sign function on top of this area and get a new area y which now contains the of the values of the function and if I then I think the axis and the execute plots I get such a thing the and you can see that yeah it looks really nice I think but so there's something missing which of the labels and it's the next who think about how much of it is that you can put lactation render its neighbours in it so forget about all this stuff about this is the same as the last slide so 1st of all I will enable his plots with the the this string units if I includes this dollar signs here I can put in lot called so and this is where it gets rendered like this as you can see here I can also specified it takes so I want to work on the X. axis I want to have 0 prior to the prior 3 pi and I do it like this so idea of set takes by giving this and with 0 applies to and 3 prime this you can see here and I want to label it with the appropriate labels and as did with in its thinks labels bound so and here I also use lot commands as you can see here inside this list of strings 1 I consider title my and I put the legend somewhere so I choose this there's this option best so the there's an algorithm which finds the best location for the legend and so the the last thing I in the it's the x axis with random lottery Reynolds the next character yeah yeah and I think you can see this this with a few commands you get a really nice looking slides for your publication and analysis of plant where publication and you can do more here I have an example of for and 3 subplots as you can see you again get the Fig object and creates and 3 subplots with 1 column and up with 1 row and 3 columns and I plots in all 3 different things in all these 3 subplots the first one is the histogram if you're I give this to the area in this area is a random this you're and
knowledge distribution hundreds of numbers of the normal distribution I give it 20 bins I can also specify the column and the transparency and the result is shown here I can also do a scatter plot well I give it to them the numbers from 0 to 29 and the numbers wants you to 29 and on top of a Gaussian distribution so again a spread here and the scatterplots makes this the adults here and the last thing you're on the right side is again a random distribution and the common cumulative sum which is also provided by number and I want to products in rats with a dashed line American this very community with giving this string here our hands minus minus and so there you can also see that the plot commands in this case only needs of a wife well use in the x values are just the numbers from the index of this array so from 0 to all
59 yeah a what work you you can export its uh in different formats and not sure what's possible but I think you can do PDF as media of all the things you need for scientific publications critical right this was on all of the of the year of the type of thing that you allowed that you mean if you if you would to publication latitude or what do you mean that all our now I am I have ever since I've never done this so I'm not sure about that yes it is a very close eye on the field we all know that the real on that people of the United the only that variable the release said that they did you know that 1 of these will not and it's a good question I'm not sure about that so in the event that Metamath modes at the pressure names it can you can you include from normal Latin texts of the now this is what I'm not sure that I didn't write it in a way that I walk up so you he was a guy who said that the path of about also yeah but but not the plant not the labels in the plot I mean if you divide the question is was them in for example here if I want my this my plot title is written in the right the allotted for if I can render this but I'm not sure if this it's possible we can try it out later maybe if if there's time I can just switch to the to the no OK so now I will come to Canada so as said and this is the library for data analysis and the 1st obvious thing that you would see is that there there are the data structures are exactly like no pirates but there also an index label and the column label and you have supports integrated for time serious which is there is especially important for my problem because I have expenses different dates and I want to analyze the dependence can also handle missing and which employ and so it has the functions for sophisticated data transformation In the following I will import hand as keen as shown here so year Main objects and panels are serious and data frames and if I have 1 dimensional data use serious as you can see here I create a panels serious by giving its the acquiring of me from random numbers and also provides an index which is in this of strings in this case but I can also use of our term types not only strings brought into Joseph loads and so if I plot it you can see that this is you can see that there's this index and data and the data type so that's that's a serious and so the other thing is a data frame from Woody dimensional terms this you can imagine this as become a lot it's an excellent spreadsheet so you have between promotion data and you have an index and you have named columns and you created and like this you handles data frame you give it an and apply area so
again I broke and creates 6 random numbers we shake them into the two-dimensional objects with 2 rows and the columns then I give the colonies from Alice Bob Charles and they give the index means 1 and so this is the data for an object where all the which
is the main the meaning of data object in and panels and say what can I do with it the 1st of all I can select the columns in this spreadsheet for example by giving enable insights this brackets so once in a while back for a moment I want this column and I can do it like this and you can see that this gives me back the serious sort of I I want to know what is the type of this object you can see that this is a seriousness upon a serious yeah that you can get all the subsets in in the state of frames with the ITQs command and the good thing about it is this did you can not only used in the cities but also labels as you can see here I would go back for a moment I once who gets this role here and the role 1 and I can it gives the ice commands in the 1st argument I label 1 and in the 2nd argument I give it indices and the final Python this is the symbol indicates that you get the hot air all all the indices in the in the area and indeed this gives you back this role but now a label is transferred the colonies names to index name so no I have and this book and ensures as index unable you and hence the name of the and there is also a name attribute in the serious which is then of course the format index label y I can also do as I said that can also do the index based slicing so here I get the you the role
and the zeros and 2nd elements in the column so get get a series of 2 elements and water
can also do penance function application so I want to apply some function on every element in the data frame and I can do this with the apply function so I define a function which acts 100 to all elements and I just apply its onto the better frame and I get a data frame with all the numbers of increased by 100 there also included so that functions for statistic like some or mean and you can see the sun just sums up all the columns and gives you back so it sums up all the columns and gives you back in a serious with new index as and Charles and then the data is not the the sum of each column yeah the next thing you can do it's much data of different shapes and this is a really cool thing about panels so you can see is that 2 different data frames with different dimensions but they share a key now here i have data 1 and she is in a scene and you have data to with also with and not if you want to merge them somehow what would be the natural thing to do so they share key therefore if I want to merge them I get something like this the and I have no 3 columns they have what that to data 1 and achieve which was shared between both of these to data frames and source so for something that this better thing about this data frame is that there is a one-to-one correspondence between the data and the key not so panels and will recognize this and so we we we recognize this one-to-one correspondence so so it will sets in whenever there's a key a and data to is 3 and this is what exactly what happens when you merge this much this so whenever k became is a and they had to catalyst free so it's somehow intelligent merging and there is and there are a lot of the stuff like this so this is really powerful way data of bringing mechanisms included in panels you can also concatenates data here you can see here to data frames of different shapes here you have 4 columns in year 3 columns and if I want to concatenate them I I I would do them together like this and I get something like this you can see here that despite the fact that they had different these to source but they different had different dimensions the concatenation works but because there was no data in the 4th column of the 2nd at a frame it will include not number as soon as 25 minutes right looking good that OK and you can see 1 another with thing about panelists which is the the handling of missing data so it will just inferred get fewer once that is the right thing to do so I I want to concatenate 2 things which are not compatible at 1st sight but you can do it with panels and it would just fill in missing values with NASA and a and and if I do something like this maybe I want to have now tools rules which with the same and I want to drop that you can also do it with using you can just say used to drop the replicates the routine and you give it into the as an argument you give it the the year the string of the of the columns of A so and in this case a so it will search if any is equal columns that would drop a 1 of them and is exactly what happens and also you can see that it keeps the 1 with the number here and there and there in the 4th column OK in for my problem it was really important that I can analyze time serious and this is also completely included in Panama the here I creates I I take the data from from the site before and read write the index so would you and the indexes knowledge from the daytime library so important they time and creates and a new indexed by giving a list of 8 times as you can see here and passing it to the other to the index of the the data frame object I also given name for the index and the name for the Common class and you can see that not appears here the name for the index and the name for the columns and so indeed so if I look at the 0 0 and 3 of the index of area you can see that this is a time stamp of the type time um yeah the next thing is when I want to plot this data in contrast to number 5 handles provides more convenient plotting routines so it on top of matplotlib that handles provides more convenient proteins and if you want to plot this data and you have already become an index of dates you have a name for this data for this axis and the name for this axis and its it would be nice if you can plot it and don't have to give the axis names and the labels this is exactly what penniless dust so if I take this data for India 3 no and time clocks then it automatically that's all the labeling for me as you can see here I had 4 columns I had even 1 column with a N a a and standards uh but it works so and you have the plots you have the dates as labels in the on the x axis you have the x axis labels and you also have the label of the year columns here with just 1 command plot it is really nice about panels so we do not have to if you're lazy you can just plot and everything is there OK and now I will come to we use cases so that's why I did all this stuff so I wanted to analyze my personal expenses this and they look like this if you get uh account statement from your from your bank you get dates the the amount of money it which was transferred and some description and I analyze my my data here so this is my data but some of our most and so in end what I the 1st thing you have to do is you have to categorize all this stuff because I want to know how much money is spent on different and stuff you know and and this by the dot I've done with written much to which called pine
cones in with which does this so it emits it reads in there you are you're account statements from your bank and it asks you is it prints out a description and ask you which categories it and you can the and then you can create pseudo so every time uh uh there's idea in industry here then it's food for example I will show you how this works in detail but the result this this so you on top of the data from the bank which is which of these 3 columns I get to the that 4th column which is a category so all of these items in this data on all categorize in so and
the other thing I want to know is how much money that is spent on some categories and panels provides a very useful tool for this and this is the by objects so so I have this data frame from this slide before it hits the by category and calculate the sum and I get exactly what I want in 1 step so I have no in the the row index on this is not the category is its name correctly and against the the sum of all these categories in in the 2nd in the 1st column energy and again I can plot this really easy with pen so OK 1st I want what I have to do now I don't need this description anymore so I have to get the subset of the serious subsets which is the 1st column and I do this with the i comma mostly with the i x command so I have this data frame the group sounds from the slide before i used as i x command and if it's a list of categories and the 1st arguments so I we want to categorize collocation kids media restaurant spots and so and the 2nd 1 is value because it only need to the value column so it is pi data on the left hand side there is just a serious which now contains the data I want and I can plot it's just with plots and I want to make a pie chart so it's the kind of higher in a given the title and does everything from me so I I have a nice picture with 1 command and I can see that I eat a lot know yeah as a sentence and minus this is not really my and it changes every time I created the the slides because it's random OK yeah you can also but the data you know shows the expenses on on foot in on the whole time spent with that of course I want to analyze how much did they pay uh on on foot in January or the change over the months or something like that and therefore I can restricts and the data to a specific time span and this is also very cool feature what kind of like and would like this I can give it just a string which is gets recognized as a as a date and I want to to to slides solids from 1st February very to 1st of April and do the same by category and some routine and not get the same kind of data but now you can see the numbers are smaller because of it's only for for 2 months instead of 6 months that the new data before was for 6 months I can also look at the expenses over the time with the move by 10 mechanisms so I take the frame who by category and gets no only 1 group the food group in this example and I get a data frame which only contains the category fruit and again I can plot in a nice way I hear you can see again this convenient talking style arguments I want to have the dots and lines and it's it's just all minus and I get nice thoughts uh with uh the right labels OK this book doesn't that say that much but you can do it for other stuff the and so that this can be interesting OK so as I said I want to get their monthly expenses off of food or power something like this and therefore I have to to sum of all the expenses in a given month and so aggregated and this can be done in in this way so I go back here so I have food in this case and now I
want for example this is both December so I want to some this up here this can be done by this resample routine so I think the sample and stands for my and how some you can also the mean or some other or something else but i want some so I would like this and I get a serious when all and that will always have the end of the month in the debates role and I have the sum for the monthly expenses on food in the 1st column and since I don't want to have a specific dates here but I want to have time spans so I don't want uh debate our corresponds not to this day 30th of November 2013 but it's corresponds to November 2013 and and therefore I can a pen you can easily convert timestamps 2 periods and this is just done with this single the command so type and taken the the phrase should try and period to period and then you get this the same data but now here you have another data type which is not a timestamp anymore but period of time period and yet and now I have what I want have monthly expenses of food that although 6 months initial for you in OK and again I can plot it so I just type plots and I get a nice but you can see that what I did yeah and you can 2 more fancy stuff for example to calculate the relative change in the monthly food expenses by taking the monthly if the month before the expenses and divide it by the data shift that 1 month earlier with this shift minus 1 and then you subtract ones to get the relative change in just with 1 command and you can see is that it gives you a serious and templates and can do much more with panels Of course and and the last thing I want I want I don't want to search for food and want is for a lot of categories and see what how they changed over the time and so on so I once something like this for if you can see that With this commands I get what I want so I new book by creating new data frames and the and now I know about all these categories that I I'm interested in then I used to school by car mechanism to gets the data for a specific group and then re-sampled it's in the same way with month and some and a new again you do dispute discourse and transformation 2 periods in in 1 step with this kind of argument and if you pants well each colony to this whole data frame and the last 1 and phi shown on the data from have exactly what I want I have the expenses for car in November December January February and so on and again this can be plotted with a single command in a nice way this but this time I he attacked plots and kind of bottled horizontal I wanted suspects so on top of each other and I get the transparency value and I don't have to label anything it's everything is they are and I now see where where my money's so went they're not really nice way the OK I hope I have shown you that this promise is that is is really nice 24 for data analysis and it's freely accessible and easy to use yeah questions if my all of the
question was hold the performances for big data of 5 words it's it's quite good but I don't know expert in this field because I'm as I said that this is a private projects for small data and so but I I know that this is a topic on PPI data conference or something like this so I think it's also useful follows from big Data yes 1 of the following site I have studied in the of the war that the OK so the performances is good enough yes but what I I I I don't know are so I cannot answer the question I just know Python and of Python's yeah I cannot answer and it's it's just a matter of taste of I mean the good thing about python and this I have local distance you can do it aims to to to be its will for the whole scientific workflow so that you can do everything you can do with programming you can do scientific publications you can write a paper in its this is still the goal of of the iPad from the local I think it is 1 of the boards and so yeah it looks for really good so it is not there yet spots to my experience it's it's just as maybe this is reason to use to use private for that and yeah but there is 1 of the things that come on all of you don't have to do it I just can't so high that you understand the whole of knowledge that from the and heard people away you use all as my name is there it is also out of his way to reduce the pressure can only be like down your of the your argument that is still vastly exceeded the maintenance of and pathways you along with your book some operators in the form of the Greek listeners yeah with the al-Qaeda they did it but then there is I mean it doesn't Henderson them by doesn't do not cover everything of course but I think then you can use isomer something like this and so I I believe it or what the leader of to the maybe the following ways yet OK yes I'm going to use the other half of the year all of the new world In the world of on the other side I am not sure I'm really new to this stuff I yeah however so what I we up that these are all the things want to hear the word of the things that they would not In this work all this means that if you look at the all out of I going to will about correlation between all these stories you hear about that on the you can use all of the that have holes in work all the answers because it's what we want to be but there is a lack of during the plenary would have said the problem is that there are all this race you know if you stand in the last right about this this is the 1st the derivative of the of my mind but the thing in the ability to store and say this and the fact that the story of all right the only thing that is that what do you think by the way that all of you OK I think there are no more questions that thank you get
Gewichtete Summe
Vorzeichen <Mathematik>
Array <Informatik>
Automatische Indexierung
Boolesche Algebra
Einheit <Mathematik>
Ordnung <Mathematik>
Lesen <Datenverarbeitung>
Objekt <Kategorie>
Folge <Mathematik>
Selbst organisierendes System
Demoszene <Programmierung>
Komplexe Zahl
Wort <Informatik>
Lateinisches Quadrat
Element <Mathematik>
Uniforme Struktur
MIDI <Musikelektronik>
Plot <Graphische Darstellung>
Lineares Funktional
Plot <Graphische Darstellung>
Konfiguration <Informatik>
Arithmetisches Mittel
Automatische Indexierung
Projektive Ebene
Web Site
Transformation <Mathematik>
Physikalische Theorie
PERM <Computer>
Leistung <Physik>
Objekt <Kategorie>
Innerer Punkt
Offene Menge
Deskriptive Statistik
Shape <Informatik>
Befehl <Informatik>
Chord <Kommunikationsprotokoll>
Kategorie <Mathematik>
Gebäude <Mathematik>
Güte der Anpassung
Singularität <Mathematik>
Dichte <Stochastik>
Rechter Winkel
Wurzel <Mathematik>
Klasse <Mathematik>
Überlagerung <Mathematik>
Open Source
Spannweite <Stochastik>
Attributierte Grammatik
Diskrete Wahrscheinlichkeitsverteilung
Kartesische Koordinaten
Einheit <Mathematik>
Wissenschaftliches Rechnen
Kontrast <Statistik>
Figurierte Zahl
Inklusion <Mathematik>
Nichtlinearer Operator
Abelsche Kategorie
Atomarität <Informatik>
Derivation <Algebra>
Kombinatorische Gruppentheorie
Mapping <Computergraphik>


Formale Metadaten

Titel Data Analysis and Visualization with Python
Serientitel FrOSCon 2014
Teil 07
Anzahl der Teile 59
Autor Stollenwerk, Tobias
Lizenz CC-Namensnennung - keine kommerzielle Nutzung 2.0 Deutschland:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
DOI 10.5446/19628
Herausgeber Free and Open Source software Conference (FrOSCon) e.V.
Erscheinungsjahr 2014
Sprache Englisch

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract Data Analysis and Visualization with Python Usage of Numpy, Pandas and Matplotlib for a personal bookkeeping software We demonstrate the usage of python's scientific tools, Numpy, Pandas and Matplotlib for data analysis and Visualization. As a use case, we present a python tool for personal bookkeeping. The talk will include: ······························ Speaker: Tobias Stollenwerk Event: FrOSCon 2014 by the Free and Open Source Software Conference (FrOSCon) e.V.
Schlagwörter Free and Open Source Software Conference

Ähnliche Filme