Using Python pandas for scientific Research

Video in TIB AV-Portal: Using Python pandas for scientific Research

Formal Metadata

Using Python pandas for scientific Research
Title of Series
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
With the pandas library there is a powerful alternative to scientific programming languages such as R, Octave or Matlab. Originally designed for the analysis of financial data is has become a standard in terms of data handling and manipulation and is widely used not just in science but also the financial industry. In this article we describe how pandas can be applied in everyday analyses where efficient data handling is required and how it can be integrated with other Python libraries such as numpy or matplotlib.
Web page Presentation of a group Momentum Computer animation Lecture/Conference
Slide rule Process (computing) Computer animation Link (knot theory) Lecture/Conference Personal digital assistant Blog Bit Mereology
Symbolic computation Web crawler Presentation of a group Code System administrator Multiplication sign Correspondence (mathematics) Source code Execution unit Set (mathematics) Data analysis Mereology Lattice (group) Computer configuration Videoconferencing Matrix (mathematics) Software framework Information security Physical system Scripting language Programming language Boss Corporation Mapping Moment (mathematics) Data mining Data management Vector space output Website Video game console Freeware Spacetime Web page Ocean current Slide rule Statistics Computer file Open source Cone penetration test Transformation (genetics) Connectivity (graph theory) Online help Modulare Programmierung 2 (number) Revision control Pi Natural number Data structure Address space Computing platform Distribution (mathematics) Graph (mathematics) Poisson-Klammer Interface (computing) Counting Database Division (mathematics) Basis <Mathematik> Word Computer animation Universe (mathematics) Window
Point (geometry) Area Time series Insertion loss Frame problem Element (mathematics) Mixture model Subject indexing Array data structure Computer animation Vector space Object-oriented programming Lecture/Conference Different (Kate Ryan album) String (computer science) Data structure Series (mathematics) Data type Computer architecture
Logical constant Code Multiplication sign Correspondence (mathematics) Numbering scheme Set (mathematics) Mereology Food energy Semiconductor memory Computer configuration Different (Kate Ryan album) Core dump Series (mathematics) Information security Vulnerability (computing) Physical system Boss Corporation File format Keyboard shortcut Electronic mailing list Bit Maxima and minima Data warehouse Variable (mathematics) Flow separation Type theory Process (computing) Vector space Telecommunication Right angle Quicksort Reading (process) Resultant Row (database) Web page Slide rule Observational study Computer file Image resolution Virtual machine Streaming media Rule of inference Element (mathematics) Pi Object-oriented programming Term (mathematics) Operator (mathematics) Selectivity (electronic) Integer Data structure Traffic reporting Form (programming) Task (computing) Quantum state Key (cryptography) Weight Content (media) Database Division (mathematics) Line (geometry) Cartesian coordinate system Frame problem Subject indexing Word Computer animation Software Personal digital assistant Network topology Mixed reality Video game Table (information)
Functional programming Computer animation Computer file Fitness function Address space Number Wave packet
Statistics Clique-width Length Diagonal Multiplication sign Real number Plotter View (database) Set (mathematics) Data analysis Mereology Theory Data conversion Scripting language Module (mathematics) Distribution (mathematics) Block (periodic table) Weight Physical law Projective plane Bit Basis <Mathematik> Process (computing) Computer animation Formal grammar Text editor Musical ensemble Marginal distribution
Frequency Computer animation Visualization (computer graphics) Image resolution Variable (mathematics) Graphics library
Computer animation Cuboid Window
Point (geometry) Web page Slide rule State observer Functional programming Statistics Group action Diagonal Code State of matter Length Multiplication sign Plotter Gene cluster Set (mathematics) Online help Open set Distance Graph coloring Scattering Number Neuroinformatik Coefficient of determination Matrix (mathematics) Representation (politics) Software testing Social class Area Standard deviation Algorithm Distribution (mathematics) Cluster analysis Maxima and minima Line (geometry) Cartesian coordinate system Variable (mathematics) Arithmetic mean Computer animation Integrated development environment Personal digital assistant Point cloud Video game Iteration Pattern language Resultant Probability density function
Computer animation
so my 1st question and the most important question is who does not speak german so is there OK I see a few hands raised so that means I will continue in English on the slides on English anyway to papers in English anyway of momentum the like a welcome to the the will presentation using Python for scientific research or originally by a gain in this talk using pun us for a scientific research but when I prepared this talk i be noticed where there's so much rarer can ride pages and pages about just about polymers so I can extend that that's a little but I hope you will like it anyway so I want to do is I want to introduce you
to Python sigh powder scientific Python package I want to discuss a little bit gloomy data-handling with upon since that as small as part of my day job and then I want to do a brief did analyzes with some Swiss bank note dataset so you will find the slides online so it's not really necessary to write everything down I can give you a link off afterwards and you will find the slides also my blog www that's
enough and the that will probably be the case above this evening or tomorrow evening OK if there are any
questions just raise them also during the talk if it's too complicated then please come to me later I'm sitting at the bar and they fall with in the men's of the OK just a few words about me I was already introduced I'm by nature a dip Kofman so studied business administration a long time ago at University they also made their my PhD in competition statistics after that I work for some of my bank and current and then in the private equity division so do you could say like something like which written but it's not that bad and since October 2015 armor analysts and the credit and Treasury Department of understood all based bank so what do we do well we have a big of their security save system so I am responsible to make sure that all the components of the system are working properly as a really big based Java-based system so it will take some time to really get into the details besides that of course on the lattice enthusiasts so the slides were made of course in a lot if you want to know more about later it's just come and visit us and besides that I'm a treasure while funding for the company if all but the fact that and make a space so if you happen to have a big room in cone please address me we're looking for new rooms OK so far so will view does not know python Mr. any Python novices the k a guess here 1st they let me just say it was started in the late 19 eighties by your from or something in the Netherlands so that's a programming language that does not originally come from the US and it's pretty readable it's understandable when you read Python code and there's a rich standard libraries so whatever you want to do there's is a high chance that you can start with Python and my introduction to upon us to Python came when I had to use and download scripts for safety safe to use some kind of online TV recall and based in Germany never went to the face or whatever you have we call it you have to download and well after 20 or 30 videos that you had to download manually it gets pretty tiresome then look for ways to get it done automatically they're found this little strip that was written in Python and I had a look at this and it it was readable I understand it from the 1st moment and then I said well pricing might not be so bad OK this realist invention that no braces and brackets that's a little bit strange but listen to and has since then I get stuck with Python and now I use it for everyday work like system administration or sending out units whatever comes across what In this work we the here some basic Hello World example yet since you already know Python you won't see anything new on this let's briefly skip that it's graph if you wouldn't know Python and you would have a look at this you would probably understand it anyway and pretty readable and understandable but introduction to a scientific panels came through a colleague of mine who left our employer and had written out pretty complicated system which with the help of Python and panels there was emerging data sets across each other and everything was stored in a big excess like this database so from all 1 minute to the other I was to being responsible person to continue and to maintain the software package and I and I hadn't known about pun as before so that was pretty interesting and I came across that and that pretty uh pretty uh easy to use you liked it OK what is pronounced while Wikipedia or respectively to web page about pump unassessed panels is an open source BSD license Free library providing high-performance easy-to-use data structures and data analysis tools for the Python programming language it was initially developed by Wes McKinney it into our capital that's also a big private equity hedge fund manager in US and didn't high-performance quantitative analysis with so they had data needed to merge data they had to aggregated and so on and so on and he wrote this library of for Python and he was able to convince his bosses to make it open source and the was pretty awesome but important part of implemented and see your site so it's quite fast the current version is 0 . 18 1 and I can definitely recommend you to have a look at it if you have to do some the dialog management or data dozens and you would like not to use our as the Penrose is part of a larger package that's the so-called sigh framework because sites panels there are a few more tools to 1st non that's the basic library that does all the matrix handling of ICT handling together with the other residents like vector manipulation and transformation at center 1 of Python of pretty also my count showed you 2 days and so do not you alive presentation but if you have ever worked with Mathematica map like IPython presents a similar way of working with the with the input files it's pretty cool he said that I use spider which is a Python based them at a time and some dollar you to conveniently work with the data and defiance and that source code maybe in the end I can show that the leaf but probably this for scientific plotting that's also the basis for the pumping library that I used today but if you no learned something like and it will be easy to work and then there's some some pie that's for symbolic mathematics now if you know mathematical that something similar the there are many many more packages that are somehow part of decide how things so there's a good chance that whatever you're doing daily research there is a corresponding Python package or sigh which or so just what
past things you can if you want to use sigh part there are of several options how we can work with that 1st of all um if you work with Linux or Mac OS and pythons already part of the Eurosystem insulation but you can install the necessary packages manually which might be a bit tricky since there are lots of dependencies so what I do recommend is to use a dedicated Python distribution for scientific Python that's who went Python just for Windows which I have fused onto a reasonably but most working pretty well and then I get stuck with clinical not because that's also available for Linux and from a and I have the same look and feel on all the platforms which is not it's time to time and pretty helpful you know what I call a down economy the presenter use his interface where you can select what you want to start like the and you put all this aside Python interface or in the by console spider in some of the tools of which I haven't used so far because what need them so definitely worth and look at
became the only I still want to show you is well after this introduction we just to data handling with panels like loading data transforming and filtering and analyzing a real datasets the so-called Swiss Bank of data OK that's a look there is a uh what panels actually dust full Python is true to provide the best structures so who a few has ever worked with our OK there I see a couple hands and are also has a concept of a data frame some kind of mixture like a two-dimensional array was different data types and looked upon as doesn't it provides the same architecture for Python there there's another from the data type which is pretty important it's a support series so what is a series but new just do N to we have is a point our 10 or something like that is a point of looking at the let's do it without the as serious as some kind of a vector that has 1 data it has an index so the different elements in the vector are addressable needed index and
index can be right can you just America from 0 to whatever it can also be string-based and so on and so on so the data frame extends to series I wanna mention so you do not have just 1 vector column but you have an area and all those columns share the same index so you can have mixed data types like this 1 it's flowed this 1 string a we can have we have more strings 1 1 loss integers whatever or time series indeed objects uh but if you have any questions just as the OK how can I create such a
data frame on this year's no well I have different options for me the most important thing is I usually load the data frames from the heart is but you can also create a manually like here for example I create 1 serious 1 part series with just a few elements then another series was just some strings and and I simply concatenate them and then I get a data frame a tentatively why I could say I want to create and edit for indirectly here were some rays like structure that's a that's the B and the content of as a reduced it's a vector and this is the core of the economy that come next they can also say the the well here we had a few problems with the beam I hope that will not happen again OK can also say 0 let's do take the series make it to a data frame and then join and missus use that they also created as a data frame and finally I could use the dictionary from those 2 vectors and create a new data from as well sometimes it's a little bit confusing what's the best way to do it but well there are different options in general when you work also was the defeated data filtering you will find different ways of doing things in Python the so it can OK as mentioned what I usually do is I read data objects from the heart is and there are different options that you can use you like to go uh I don't know if you have heard about people who does not know pickle of a k I just raised my hand and will not that people is a way of serializing things and Python is also an object-oriented programming language and you can have objects in memory and if you want to store those objects in memory of the 2 database sort to some fire you need to put them in or you need to flatten them they need to fit into a file and pitiless 1 way of doing this we table yeah that's a command for general table like format maybe of generalization of the reads scheme for comma separated values then we have some FWF fixed with formats I've never came across such a fixed with former but I know that in our especially in finance there are some performance where you really have these fixed with so the 1st 50 characters are the name of the next 50 characters are some hold the amount or whatever and it can be really funny to work with and so we if we support that we say that directly from the keyboard so you can say copy of controversy and excellent kind can say a weak support in Python they had reads the the support and well the thing that I usually use in my daily work read exel because we have lots of X so far and also big so far as sorry but we need to work with so the maximum I guess for me was about 200 or 300 megabytes that I had to read and had to work with and well that have to reach 300 megabytes in Python is a bit slow but well it's a natural way there are still other commands for HTML Jason each year 5 it's also some of the guests in Python format of no use at the death of a thousand and 1 should mention is that these are the read commands for forthright commands there are on overriding data there are corresponding right commands as well I have so that is 1 panels example where I really like us and piety but we have a proprietary software that users are we date form like this for team are 1983 estate format that gives me a cc fire and I want to open that and if you open that an excellent excellent ostensive sometimes not all the time because something like Juno dual does not get interpreted as some of us do not working as well there it doesn't work because if you have hundreds of lines of thousands of lines you cannot do that manually what I simply need to do is I take this season phyla take the I transform the evil deeds and saved the day then x a former so here's the necessary code they simply looked upon as library and I read the system the fire in under the assumption that there are some new things that need to just like the column separator ordered dozens of separator etcetera etcetera but I have the big column that's the columns with the dates inside and I said the same this they come there should in the same column but converted to a daytime object and and as saying put everything to and here we go so was just 4 lines I can save a lot of time because if you do have to do that manually or with somebody should be easy for application code in you won't really know that it's not a nice job this is the pretty cool way OK learning from the example what have we just seen that we looked at the Palace Library reload the data in CSV format we converted to the Python daytime object which she also next slides and we save the data in X of formant again that's it so and if you have a look at the code and which even without knowing Python it's pretty likely that you understand what this actually means that makes it also easier to get some code from somewhere else but to adjust the cancer consider say OK that's the line where should convert the object the term OK let's adjust and see what comes such what was the what is really really cool in my daily job that's the way that can allows me to select and filter data from i in the just a few weeks ago my task was to generate a report about some jobs and non-bank we sell the securities and options and so on and I've also simply wanted to get a list of how much was so it and then I came across was so and its use panels for that because it makes it easier to take always a lot of time when you have to do that manually and I used a lot of this Python selection filtering because if for example I have a really big data frame like me say 200 columns and I simply need apply for 10 they
can tell tell upon us or what are the columns that I'm interested in that in this case it's just column a and column B so associated my data frame data frames in a new data frame but with the selection of only those 2 columns OK but that I could say OK I only want the 1st 2 rows OK then we have some pie a weight of selecting wasn't simply saying everything up to the 1st row just keep in mind to the that really 1st rule in a file is addressed by a 0 I can select only rose where some common value is greater than some other values like here for example I want to have In my data frame only those elements of the radical and a exceeds 50 I could also makes that with some or the and and OR operator I only want those where come a is greater than 500 or for the colony is smaller than 50 again I can say I only want those rows red a common value is not some specific way and we have users I would say to them which simply negates what comes after what the so everything that is not had word but will be energy difference there's a really good side where you can see more about this Python filtering this page from trees some it has a lot of indexing and selecting they're pretty good to see hiding recommended to visit that OK what is it yes really thought of it you can you can find you know and you call the what is the low on on all that all our I think any to move the whole I'm not really sure that get the question was if I can apply this command to restrain why if I have 2 notable if I had before and my understanding is that I have to notify before so that might be a good reason to study boss Abubakar machine I don't know whether woodworkers streams like that you type into the more mature I think everything is loaded but that's something and not really a picture of OK will what is also rarely handy this when you can watch data we have different data sets 1 comes from the data warehouse system 1 comes from the securities trading system and you have to make sure that everything that is in the trading system was actually received in the data warehouse we're OK how do you do that you have some reports and you say OK how do I match those 200 megabytes each what panelist supports here I guess it's a little bit hard to read as it supports merging some something that you would normally do in a database that can be done with upon us just on the command line very nice but it supports the electrons were left all the joints right all the joints through all the joints and in a joint and it's a real fun to work with that because it makes life so much easier and the the worst they would be you have to load everything to my SQL terrestrial so whatever you use you do emerging there and bright everything out it's not necessary just 1 wrote it simply does have a data frame here that's the left the the frame I have the right at the frame I have some key columns with and I say they that frame will hidden as p marriage left dataset right set on which key that's it the resulting data frame then all those constants are also save so much time this 1 you know that a question was if the key variable has to be the same in in both datasets does not have the option to set left underscore on right on us go on to define the keys but that's that's very pretty also but in his own example that I had to do a few more weeks or months ago I had a data set where in 1 column the actual common name was given and in the other column was the value so column a actually meant column and Colby common being the thing that I needed to somehow merchant or transform then I came up with some sort something is not working with this slide In OK that's the making up was a few lines of Python code by simply read axle I create a new data frame for the result so that's an anti did the frame was just the columns a B and C then I iterate through the data set that are loaded and use some integer integer division here and then as set the role to the corresponding values value I guess we need to exchange to a resolution in there but and this because there will be be more slides with more content and again it's depends on the amount of content on the slide but another example not from a big business job but from my own duties as treasurer to being from weak Reverend Germany donations to and being fully got tax-deductible so at the end of the year everybody from members wants to have the the sheet over which says hey we has provided 200 EUR to the and he's able to put this on this text acceleration the but if you do it manually it's really error-prone it takes a lot of time because we have about 100 members and well you can do that manually it's so horrible job and what I did last year I was in use a complicated mix of python my SQL and later of course His work for that I had loaded all the data into my SQL then I use Python commands to Crary that for each member etc etc. well it took a lot of time but it worked and this year and I knew panelists and so is that OK let's do everything in unless it was way easier as simply loaded the data into memory emerged around I food at a selected and so on and so on and what what I came up with that C of wireless is working in it is it might look a bit
complicated but well yeah that that's to code which does everything I read to addresses I define some of the functions to give out cardinal number so written out numbers I use a lot of panels called I take the master data for you for each member because I have to print address there and I looked to bookings buy food everything out that doesn't fit and at the end what comes out is a text file which it can simply compile let we just see I have some years belonging to the books the troops but skin training was so I with I became use use this 1 here but if
we make this a bit big yet yeah so this it
comes out from the script so it's of course if you have 2 minutes to do that a hundred times manually is a horrible thing and here takes that I guess about 5 minutes of compiling everything to a PF and then you're done the initial effort was of course a few days but who have that is 2000 yeah just another
example from this territory job and I have to check the payments so that member xyz payments due to this is due to and if you do it manually you have to do it in X so they have to moved to Mosul up until you get what you want and as such way not used pain that in data with partners together they conversion with the data master data they also created a block entry about that if you want to have a look at it that makes things much easier because in Panama's you can also pelletise data like an excellent view of you paper the tables of before it makes things much music so I can say 0 4 dismembered again uh get all the payments in 1 number and know when he has paid I know what he is paid and so on and so on that can really this was the data that's break OK questions so far before we start Mr. 6 pi like yes there the set all the up their Penrose not the the world we yes sorry the question was if panels has a built-in plotting capabilities nets are rare said no it does not because panels just cares for the for the data handling it best use matplotlib that's the library for plotting things and if you have Python distribution with everything included the new simply saying but I want to plot this and Matplotlib than does the trick now show some examples to coming what kind of it or yeah I don't the question was if there is a module in Python which is somehow similar to G plot and ah that's where I can say yes I think so because I guess the phenomenon of matplotlib is also did you called if you do not know you plot it so that's the way of doing graphics and ah I noted I wrote that those then also know the book which is based on which I have no clue about because that's really really weird stuff OK yes in the morning on the basis of the laws we got a is the moon can be utilized to assume the use of violence so and you all that that's interesting so the comment was that there was a project for Python which also tries to implement the grammar of graphics the grammar of graphics as the book by the way can I guess but which justice theoretical stuff about graphics so I knew discussed from a conference and I had a look at the grammar of graphics but it was way beyond a intellect that is really what all editor it did as chair and I still think that this could be pretty interesting but here what I want to show you in the 2nd part so I want to show you how they real data analysis can be done in Python I have selected 1 dataset because that the an old from my lectures on what it statistics as from book Byron through intermediate view from 1988 which is also used in the book from my phd father that's moderate statistical analysis and if the data set of counterfeit about fate and about general and we'll see this bank notes the dataset has 7 columns the length of the loaded with of the left edge the width of the right edge the bottom margin the top marginal length of the diagonal and the status is a genuine banknote or fake 1 that's a look at some graphic here we
have 2 we sometimes have the graphics and in or who should in linking the same frequency than the graphics is always there but that the visualization of the variables and I guess that that's what the dealer doesn't like this
but OK I guess there's not frequency that some stochastic process behind but it what we need to do is uh sorry for that but we import panelists really important empire OK especially for those that are needed and port seaborne as as an s that's this graphic library which is built on the top of the the but but it just gives a 2nd now I will try to reduce to a resolution that
lie at give us the technical
advantage so Windows
tend yeah wants and
let's try to attention 10 something if that does not working in this it should be
some in I think really the box
so king of the only allergic landmark but kgs sorry for the but if you when have now as I have datasets loaded well it's just 200 Rose it's not so much an layers we want to do this when I 1st node gets in data I want to get an overview of what is the data like and and make sure that I will remove this is really annoying you the but I want to get a summary and Python provides a the command here this is this describe it gives me a 5 number summary 5 number summary is at the counter variable counter mean standard deviation minimum maximum into 25 per cent 50 per cent and the 75 per cent quartet so that's for these 4 variables that we selected that that's a five-member summary OK when you have a lecture on statistics that's what you will likely encounter OK I can also create a graphical representation of this 5 numbers summary that the so called boxplot off so that's the applied this c mon library which has a boxplot command and I simply say well on the X. axis I wanted to status its generalize of counterfeited and on the y axis I want to have the diagonal and they'll if we have a look at this time and also in the scatter plot matrix which we'll see next we will see that this variable hold some insight about what is genuine what is counter so that's a useful variable and that's the sky scatterplot matrix as simply plot those point of this variable against and some other variable and I get some point clouds and what I have here is I also use the color color the S distance to decide if it's a genuine or counterfeit so let me just check on the moon is general and and green counterfeit so these the counterfeit variables plotted left to diagonal so we we can also see some point clouds and said that inside OK so after that let's just
come to the final example let's do some cluster analysis with the data and to see if in a group of data if we wouldn't notice status if we could find some groups there are hundreds of algorithms for cluster analysis you can spend a lot of time and a lot of books just learning about this uh what I want to do here is some simple k-means clustering it's rather simple to explain k-means means came use I define a number k the number of clusters which are finally want to have and we're talking here about banknotes like genuine and counterfeit so I would expect to get 2 groups 1 with counterfeit banknotes 1 was genuine banknotes OK and the algorithm works as following this is here from Wikipedia for 3 of the clusters so let me just right click and explain it on this by example given we would have 3 groups we would simply select 3 the observations randomly then we would calculate the distances from each point of but these cluster centers and their relative distance gets minimize always say OK that's that's where point should belong to get the cluster like for this and the we take the case the this 1 here and created a distance and agree any and all of this should be the cluster center and full of observations i do the same thing and after some iterations of checking the distances again we would get there's some state disability situation classes and are not changing anymore it will work and what I can do is I can also do the same with Python because Python has its own library Open us the delivers that it's this sigh pipe laughter the I simply import the k-means and DVQ function that there's really horrible you need to fix this so again low to data the what of but there are I only make sense OK and some of the folks so download the slides from my own page on his environment so what is ambiguous I restrict that they different in 2 columns to just length of diagonal I convert everything into a numeric area because the k-means wants to have a numeric area and then by computer centroids to that's the center of the cluster with the k-means algorithm and I assigned each data point the set With the help of the Q function 1 of the centroids the it was finally canceled is another scatterplots here which I will then also save as so what you see in my slides is directly to PDF from the from the code and what I get is so far as good a scatterplot matrix again you know I haven't been able to uh yesterday to take out the assignment because that's 19 basically something which I do not want to know what I'm interested in here are the 2 clusters so these are not the assignments a cluster based on the actual status which sets genuine or 2 but it's the result of the class analyzes that we have performed and we could dig deeper into that and we would find that this 1 observation that we had let me just show you this 1 if you see here the the small dogs that is actually a genuine banknote but well when we compare the distributions it falls into the and to deny myself to counterfeit data so and if we look at the results from the test analyzes and this will be our observations which would be classified wrongly became is very could simply go on the let me just come to the conclusion that that works a I guess it's really dependent on the amount of data which is on the slide OK python with upon inside but proved to be a really valuable tool and not just for all my scientific work but also for the daily work to any to perform in my department that greatly simplifies my life in every day and that was I spent hundreds of lines of code just with uh programming panels and its every time a fun again I can only recommend to to check it out and if you want to know more of some to have some questions about what I've shown today just visit me it and the we sit in an enzyme we have a lot about LaTeX can also show you some and just come by OK the any questions so far no I'm not but some literature recommendations that there's a life science StackExchange what I normally do so simply blew it up and find something exchange but there are some good it's uh it's some books like learning panels monthly panels finance pattern for the damages and so on and so on however that any pages