Using Python pandas for scientific Research
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 84 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/32424 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FrOSCon 201660 / 84
2
4
5
7
11
16
25
26
27
28
29
30
31
32
34
36
37
40
41
43
46
47
48
49
50
52
55
56
58
59
62
63
64
65
66
67
68
69
70
71
72
73
75
76
77
79
82
83
84
00:00
Presentation of a groupMomentumWeb pageSlide ruleWritingXMLUMLLecture/Conference
00:49
Process (computing)MereologyBitBlogSlide ruleLink (knot theory)Data analysisCASE <Informatik>Computer animationLecture/Conference
01:28
Open sourceSymbolic computationMereologySource codeWordBoss CorporationComputer configurationScripting languageData managementCorrespondence (mathematics)Element (mathematics)VideoconferencingData analysisCodeProgramming languagePiVector spaceData structureBasis <Mathematik>Subject indexingAddress spacePoint (geometry)Cone penetration testSeries (mathematics)Computer architectureWebsiteSoftware frameworkMultiplication signExecution unitPoisson-KlammerRevision controlData typeSystem administratorComputing platformArray data structureDistribution (mathematics)WindowNatural number2 (number)Moment (mathematics)CountingInformation securityTransformation (genetics)Video game consoleDivision (mathematics)Matrix (mathematics)Ocean currentOnline helpInterface (computing)StatisticsPhysical systemUniverse (mathematics)Data miningGraph (mathematics)Presentation of a groupWeb pageFreewareSpacetimeModulare ProgrammierungDatabaseWeb crawlerSlide ruleSet (mathematics)Computer fileLattice (group)outputDifferent (Kate Ryan album)Level (video gaming)Mixture modelConnectivity (graph theory)Frame problemWeb 2.0AlgorithmBitEmailRootText editorOpen setModal logicSimilarity (geometry)PlotterRow (database)NavigationNeuroinformatikComputer animation
10:58
Multiplication signVector spaceSeries (mathematics)Dimensional analysisSubject indexingFrame problemAreaObject-oriented programmingTime seriesInsertion lossData typeString (computer science)Lecture/Conference
11:39
File formatSeries (mathematics)State of matterInformation securityObject-oriented programmingSubject indexingSoftwareProcess (computing)Task (computing)Traffic reportingReading (process)Frame problemSemiconductor memoryComputer configurationCodeSelectivity (electronic)Multiplication signComputer fileElectronic mailing listLine (geometry)Boss CorporationSlide ruleFlow separationHard disk driveDatabaseStructural loadFilter <Stochastik>Cartesian coordinate systemContent (media)Row (database)Data structureForm (programming)MereologyTerm (mathematics)Numbering schemeTable (information)QuicksortCore dumpCASE <Informatik>Vector spaceQuantum stateVulnerability (computing)Element (mathematics)Keyboard shortcutWeightPiMaxima and minimaPhysical systemBitComputer animation
19:23
Frame problemCodeDifferent (Kate Ryan album)Functional programmingType theoryAddress spacePhysical systemVariable (mathematics)ResultantNumberFood energySet (mathematics)Multiplication signRow (database)Logical constantWordOperator (mathematics)Image resolutionIntegerCorrespondence (mathematics)Information securityWeb pageData warehouseBitWave packetKey (cryptography)Computer fileTraffic reportingSlide ruleRight angleFitness functionComputer configurationObservational studySubject indexingContent (media)Network topologyDatabaseVulnerability (computing)Process (computing)Video gameLine (geometry)TelecommunicationDivision (mathematics)QuicksortVirtual machineBoss CorporationElement (mathematics)Streaming mediaSemiconductor memoryMixed realityRule of inferenceInterior (topology)WebsiteWritingFilter <Stochastik>Server (computing)Selectivity (electronic)Error messageDeclarative programmingComputer animation
27:07
BitMultiplication signScripting languageComputer animationSource codeXML
27:39
Distribution (mathematics)Module (mathematics)FrequencyTheoryProcess (computing)Visualization (computer graphics)Formal grammarLengthProjective planeClique-widthGraphics libraryStochastic processBlock (periodic table)2 (number)NumberSet (mathematics)Table (information)Pivot elementImage resolutionMathematical analysisData analysisReal numberMultivariate AnalysePlastikkartePlotterWeightBasis <Mathematik>Text editorMusical ensembleView (database)Variable (mathematics)Data conversionStatisticsDiagonalPhysical lawMereologyMarginal distributionComputer animation
32:47
WindowCuboidComputer animation
33:19
Variable (mathematics)State observerLengthGroup actionIntegrated development environmentCodeDistanceVideo gameWeb pageProbability density functionNumberDiagonalGraph coloringSlide ruleStatisticsRepresentation (politics)Point (geometry)Multiplication signAreaSocial classIterationResultantAlgorithmScatteringCASE <Informatik>Gene clusterMatrix (mathematics)Line (geometry)Standard deviationPoint cloudDistribution (mathematics)PlotterNeuroinformatikCoefficient of determinationState of matterOpen setPattern languageFunctional programmingCluster analysisOnline helpMaxima and minimaSet (mathematics)Software testingArithmetic meanCuboidComputer programmingHome pageGoodness of fitStack (abstract data type)Data analysisMathematical analysisStability theoryDot productGreen's functionFrame problemComputer animation
41:36
Computer animation
Transcript: English(auto-generated)
00:07
So my first question, and the most important question is, who does not speak German? So is there, okay, I see a few hands raised. So that means I will continue in English. The slides are in English anyway.
00:20
The paper is in English anyway, so it won't matter. Okay, welcome to the presentation using Python for scientific research. Originally I named this talk using pandas for scientific research. But when I prepared this talk, I quickly noticed, well, there isn't so much where I can write pages and pages about, just about pandas.
00:43
So I extended that a little, but I hope you will like it anyway. So what I want to do is I want to introduce you to Python and SciPy, the scientific Python package. I want to discuss a little bit the data handling with pandas, since that is more or less part of my daily job.
01:01
And then I want to do a brief data analysis with some Swiss banknote data set. So you will find the slides online, so it's not really necessary to write everything down. I can give you the link afterwards, and you will find the slides also in my blog, www.Wuwitzingen.de.
01:21
That will probably be the case this evening or tomorrow evening. Okay, if there are any questions, just raise them also during the talk. If it's too complicated, then please come to me later. I'm sitting at the Dante E. Pfau booth in the Mensa.
01:40
Okay, just a few words about me. I was already introduced. I'm by nature a diplomae kaufmann, so I studied business administration a long time ago at Humboldt University. I also made there my PhD in computational statistics. After that, I worked for Saloppenheim Bank in Köln, and then in the private equity division.
02:01
So you could say something like Heuschreckten, but it's not that bad. And since October 2015, I'm an analyst in the credit and treasury department of a Dusseldorf-based bank. So what do we do? Well, we have a big security sales system.
02:21
So I'm responsible to make sure that all the components of the system are working properly. And it's a really big base, Java-based system, so that will take some time to really get into the details. Besides that, of course, I'm a LaTeX enthusiast, so these slides were made, of course, in LaTeX.
02:42
If you want to know more about LaTeX, just come and visit us. And besides that, I'm a treasurer for Dingfabrik Kölnifau. That's a fab lab and maker space. So if you happen to have a big room in Cologne, please address me. We are looking for new rooms. Okay, so far.
03:00
So who of you does not know Python? Is there any Python novices here? Okay, I guess you have heard. Okay, let me just say it was started in the late 1980s by Guido van Rossum in the Netherlands. So that's a programming language that does not originally come from the US. And it's pretty readable.
03:23
It's understandable when you read Python code. And there's a rich standard library, so whatever you want to do, there's a high chance that you can start with basic Python. And my introduction to Python came when I had to use a download script for Safe TV.
03:40
Safe TV is some kind of online TV recorder based in Germany. And they have a web interface, so whatever you have recorded, you have to download. And after 20 or 30 videos that you had to download manually, it gets pretty tiresome. And I looked for a way to get it done automatically.
04:01
And I found this little script that was written in Python. And I had a look at it, and it was readable. I understood it from the first moment. And then I said, wow, Python might not be so bad. Okay, this weird indentation, no braces and brackets. That's a little bit strange, but let's have a look.
04:20
And yeah, since then, I got stuck with Python. And now I use it for everyday work like system administration or sending out emails. Whatever comes across. Here's some basic hello world example. Yeah, since you already know Python, you won't see anything new on this.
04:45
Let's briefly skip that. If you wouldn't know Python and you would have a look at this, you would probably understand it anyway. It's pretty readable and understandable. Okay. The introduction to scientific pandas came through a colleague of mine who left our employer.
05:06
And he had written a pretty complicated system which, with the help of Python and pandas, was merging data sets across each other and everything was stored in a big access database. So from more or less one minute to the other,
05:22
I was the main responsible person to continue and to maintain the software package. I hadn't known about pandas before, so that was pretty interesting. And I came across it and I pretty easily liked it.
05:41
What is pandas? Well, Wikipedia or respectively the web page about pandas says, Pandas is an open source, BSD licensed library providing high performance, easy to use data structures and data analysis tools for the Python programming language. It was initially developed by Wes McKinney at AQR Capital.
06:03
It's also a big private equity hedge fund manager in the US. And they did high performance quantitative analysis with it. So they had data, they needed to merge data, they had to aggregate and so on and so on. And he wrote this library for Python and he was able to convince his bosses to make it open source.
06:24
And that was pretty awesome. The important parts are implemented in C or siphon, so it's quite fast. The current version is 0.18.1 and I can definitely recommend you to have a look at it if you have to do some data management or data analysis and you would like not to use R or SPSS.
06:47
Pandas is part of a larger package, that's the so-called SciPy framework. Because SciPy Pandas, there are a few more tools. The first is Numpy, that's the basic library that does all the matrix handling
07:02
or vector handling together with the algorithms like vector manipulation, transformation, etc. IPython, that's pretty awesome. I can't show it here today since I do not do a live presentation. But if you have ever worked with Mathematica or Matlab,
07:22
IPython presents a similar way of working with the input files. It's pretty cool. Besides that, I use Spyder, which is a Python-based editor. And root allows you to conveniently work with the data and the files and the source code.
07:41
Maybe in the end I can show that briefly. Matplotlib is for scientific plotting. That's also the basis for the plotting library that I use today. If you know something like Matlab, then it will be easy to work with. Then there's SymPy, that's for symbolic mathematics.
08:03
If you know Mathematica, that's something similar. There are many more packages that are somehow part of the SciPy framework. So there's a good chance that whatever you do in your daily research, there is a corresponding Python package or a SciPy package for it.
08:20
So just have a look at it. You can, if you want to use SciPy, there are several options how you can work with it. First of all, if you work with Linux or Mac OS, and Python is already part of your system installation,
08:43
you can install the necessary packages manually, which might be a bit tricky since there are lots of dependencies. So what I do recommend is to use a dedicated Python distribution for scientific Python. That's WinPython, just for Windows, which I have used until recently.
09:03
It was working pretty well. And then I get stuck with Anaconda, because that's also available for Linux and for Mac OS, and I have the same look and feel on all the platforms, which is, from time to time, pretty helpful. What Anaconda Navigator presents you is here,
09:22
this interface where you can select what you want to start, like Jupyter, there's this IPython interface, or here the IPython console, Spyder, and some other tools which I haven't used so far, because I didn't need them. So definitely worth to have a look at.
09:42
Okay, what I still want to show you is, after this introduction that we just had, data handling with Pandas, like loading data, transforming and filtering, and analyzing a real dataset, the so-called Swiss banknote data. Okay, let's have a look. What Pandas actually does for Python is to provide data structures.
10:07
So who of you has ever worked with R? Okay, I see a couple hands. And R also has the concept of a data frame, some kind of mixture, like a two-dimensional array with different data types.
10:22
And what Pandas does is it provides the same architecture for Python. There's another data type which is pretty important, that's the so-called series. So what is a series? Let me just check. Do we have a laser pointer or a pen or something like that?
10:41
Laser pointer? Okay, then let's do it without. A series is some kind of a vector. It has one data type and it has an index. So the different elements in the vector are addressable via the index. And the index can be, here it is numeric from zero to whatever.
11:03
It can also be string-based and so on and so on. So the data frame extends the series by one dimension. So you do not have just one vector column, but you have an array here. And all those columns share the same index. So you can have mixed data types, like this one here is float, this one is string.
11:25
We can have more strings, more numbers, integers, whatever, or time series, or even data objects. If you have any questions, just ask. Okay, how can I create such a data frame or a series?
11:43
Well, I have different options. For me, the most important thing is I usually load the data frames from the hard disk. But you can also create them manually. Like here, for example, I create one series, one panda series with just a few elements. Then another series with just some strings.
12:03
And then I simply concatenate them. And then I get a data frame. Alternatively, I could say I want to create a data frame directly here with some brace-like structure. That's the A, that's the B, and the content of it.
12:20
So this is the vector, and this is the column index. I can also say, okay, we had a few problems with the beamer. I hope that will not happen again. I can also say, oh, let's take the series, make it to a data frame, and then join it with the series that I also created as a data frame.
12:44
And finally, I could use the dictionary from those two vectors and create them in a data frame as well. Sometimes it's a little bit confusing. What's the best way to do it? But, well, there are different options. In general, when you work also with the data filtering, you will find different ways of doing things in Python.
13:12
Okay. As mentioned, what I usually do is I read data objects from the hard disk. And there are different options that you can use here, like pickle.
13:23
I don't know if you ever heard about pickle. Who does not know pickle? Okay. I just raised my hand a little. Pickle is a way of serializing things. Python is also an object-oriented programming language, and you can have objects in memory. And if you want to store those objects in memory to a database or to some file, you need to flatten them.
13:48
They need to fit into a file. And pickle is one way of doing that. So, retable, yeah, that's a command for general table-like formats. Maybe a generalization of the read CSV for comma-separated values.
14:04
Then we have some fwf, fixed-width formats. I've never came across such a fixed-width format. But I know that, especially in finance, there are some formats where you really have these fixed-widths.
14:20
So, the first 50 characters are the name. The next 50 characters are somehow the amount or whatever. And it can be really funny to work with this. And we have read-clipboard that reads data directly from the clipboard. So, you can say copy a control-c in Excel.
14:40
You can say read-clipboard in Python. Then it reads the clipboard. And the thing that I usually use in my daily work is read-excel. Because we have lots of Excel files and also big Excel files that we need to work with. So, the maximum, I guess, for me was about 200 or 300 megabyte that I had to read, that I had to work with.
15:05
And, well, if you have to read 300 megabytes, even Python is a bit slow. But, well, that's the natural way. There are still other commands for HTML, JSON, HDF5. There's also some, I guess, some Python format.
15:20
I've never used it. Yeah, worth having a look at. And what I should mention is that these are the read commands for write commands. For writing data, there are corresponding write commands as well. Okay. So, that is one Pandas example where I really like Pandas and Python.
15:44
We have a proprietary software that uses a weird date format. Like this 14 Mar 1983. A state format that gives me a CSV file. And I want to open that. And if you open that in Excel, Excel understands it sometimes.
16:01
Not all the time. Because something like June or June does not get interpreted. December is not working as well. Yeah, it doesn't work. Because if you have hundreds of lines or thousands of lines, you cannot do that manually. What I simply need to do is I take the CSV file.
16:20
I transform the evil dates. And I save the data in Excel format. So, here's the necessary code. I simply load the Pandas library. I read the CSV file under the assumption that there are no things that I need to adjust. Like the column separator or the decimal separator, et cetera, et cetera.
16:43
I have this date column. That's the column with the dates inside. And I simply say this date column should be the same column but converted to a datetime object. And then I simply say, put everything to Excel, and here we go.
17:03
So, with just four lines, I can save a lot of time. Because if you have to do that manually or with some visual basic for application code, you won't really, yeah, it's not a nice job. This is a pretty cool way. Okay, learning from the example, what have we just seen?
17:22
We load the Pandas library. We load the data in CSV format. We convert it to the Python datetime object. You see also the next slides. And we save the data in Excel format again. That's it. So, and if you have a look at the code, I mean, even without knowing Python, it's pretty likely that you understand what this actually means.
17:48
That makes it also easier if you get some code from somewhere else. To adjust that. Because you can simply say, okay, that's the line where it should convert the object to datetime. Okay, let's adjust this and see what comes out.
18:03
Okay. What is really, really cool in my daily job, it's the way that Pandas allows me to select and filter data. Just a few weeks ago, my task was to generate a report about some jobs in our bank.
18:24
We sell securities and options and so on. And my bosses simply wanted to get a list of how much was sold. And then I came across with, yeah, let's use Pandas for that because it makes it easier. It takes always a lot of time when you have to do that manually.
18:43
And I used a lot of this Python selection and filtering. Because if, for example, I have a really big data frame, like maybe say 200 columns, and I simply need five or ten, I can tell Pandas what are the columns that I'm interested in.
19:00
Yeah, in this case, it's just column A and column B. So I say my data frame, my old data frame is simply the new data frame, but with the selection of only those two columns. Okay. Then I could say, oh, okay, I only want the first two rows. Okay, then we have some Pythonic way of selecting because I'm simply saying everything up to the first row.
19:25
Just keep in mind the really first row in the file is addressed by zero. I can select only rows where some column value is greater than some other value. Like here, for example, I want to have in my data frame only those elements where column A exceeds 50.
19:48
Okay. I could also mix that with some R operator. I only want those where column A is greater than 500 or column A is smaller than 50.
20:05
Okay. I can say I only want those rows where the column value is not some specific way. Then we have these, I would say, which simply negates what comes afterwards.
20:22
So everything that is not hello world will be in the data frame. There's a very good site where you can see more about this Python filtering. That's this page from Chris Selbom. He does a lot of indexing and selecting there. It's pretty good to see. Highly recommended to visit that.
20:47
Okay. Yes?
21:03
I think I need to load the whole. I'm not really sure. The question was if I can apply this command to a stream or if I have to load the whole file before. My understanding is that I have to load the file before.
21:20
So that might be a good reason to tell your boss I need a bigger machine. I don't know whether it would work with streams like that you pipe into. I'm not sure. I think everything is loaded. But that's something I'm not really sure of. Okay. What is also very handy is when you can merge data.
21:46
You have different data sets. Maybe one comes from the data warehouse system. One comes from the securities trading system. And you have to make sure that everything that is in the trading system was actually received in the data warehouse. Okay. How do you do that?
22:01
You have some reports and you say, okay, how do I match those 200 megabytes each? What Pandas supports here, I guess it's a little bit hard to read, is it supports merging. Something that you would normally do in a database that can be done with Pandas just on the command line.
22:21
Very nice. It supports left joints, left outer joints, right outer joints, full outer joints and inner joints. And it's a real fun to work with that because it makes life so much easier. The worst way would be you have to load everything to MySQL, to SQL server or whatever you use.
22:42
And you do your merging there and write everything out. It's not necessary. It's just one row. It's simply this. I have a data frame here. That's the left data frame. I have the right data frame. I have some key columns. And I say data frame, here it is. PD merge, left data set, right data set on which key?
23:03
That's it. The resulting data frame then has all those columns inside. Awesome. Saved so much time. Yes? No. The question was if the key variable has to be the same name in both data sets.
23:21
No, it does not. You have the option to set left underscore on, right underscore on to define the keys. So it's very pretty awesome. Okay. Here's an example that I had to do a few weeks or months ago.
23:40
I had a data set where in one column the actual column name was given and in the other column was the value. So column A actually meant column A and column B. The thing is I needed to somehow merge that or transform that. Then I came up with some sorry, something is not working with the slide.
24:06
Okay, let's see. Then I came up with a few lines of Python code. I simply read the Excel. I create a new data frame for the result. So that's an empty data frame with just the columns A, B, and C.
24:21
Then I iterate through the data set that I have loaded and use some integer division here. Then I set the row to the corresponding value. I guess we need to change the resolution here. Okay. Yeah, because there will be more slides with more content and I guess it depends on the amount of content on the slide.
24:49
Another example, not from my business job, but from my duties as treasurer to Dinkfabrik. We are in Germany. Donations to Dinkfabrik are tax deductible.
25:01
So at the end of the year, everybody from our members wants to have a sheet which says, hey, Uwe has provided 200 euro to Dinkfabrik and he's able to put this on his tax declaration. If you do that manually, it's really error prone. It takes a lot of time because we have about 100 members and you cannot do that manually.
25:24
It's a horrible job. And what I did last year was I used a complicated mix of Python, MySQL, and LaTeX, of course. I wouldn't use Word for that. I had loaded all the data into MySQL. Then I used Python commands to query that for each member, et cetera, et cetera.
25:43
Well, it took a lot of time, but it worked. And this year, I knew Pandas. And so I said, okay, let's do everything in Pandas. It was way easier. I simply loaded the data into memory. I merged around, I filtered, I selected, and so on and so on.
26:00
And what I came up with, let's see if wireless is working here. It might look a bit complicated, but well, yeah, that's the code which does everything. I read the addresses. I define some functions to give out cardinal numbers, so written out numbers.
26:23
I use a lot of Pandas code. I take the master data for each member because I have to print the address there. And I load the bookings. I filter everything out that does not fit. And at the end, what comes out is a tech file which I can simply compile.
26:43
Let me just see. I have some here. Dropbox.
27:03
Okay, let's use this one here. We make this a bit bigger. So that's then, that comes out from the script. So it's, of course, if you have to do that a hundred times manually, it's a horrible thing.
27:24
And here it takes, I guess, about five minutes of compiling everything to a PDF, and then you're done. The initial effort was, of course, a few days, but well, who cares? Okay. Yeah, just another example from the treasure job.
27:43
I have to check the payment. So did member XYZ pay his duty? And if you do that manually, you have to do it in Excel. You have to move the mouse a lot until you get what you want. And I said, oh, why not use payment data with Pandas together?
28:01
I can merge them with the master data. I also created a block entry about that, if you want to have a look at it. It makes things much easier because in Pandas you can also pivotize data. Like in Excel, if you have used pivot tables before, it makes things much smoother.
28:20
So I can say, oh, for this member, I get all the payments in one number. I know when he has paid, I know what he has paid, and so on and so on. And I can really dig through the data. That's pretty cool. Okay. Questions so far before we start with the second part?
28:41
Yes. Pandas not. Yes, sorry. His question was if Pandas has built-in plotting capabilities. And that's where I said, no, it does not because Pandas just cares for the data handling.
29:07
It does use matplotlib. That's the library for plotting things. And if you have a Python distribution with everything included, then you simply say, okay, I want to plot this, and matplotlib then does the trick.
29:21
I'll show some examples later. The question was if there's a module in Python which is somehow similar to ggplot in R. And that's where I can say, yes, I think so, because I guess the fundamental of matplotlib is also ggplot.
29:47
If you do not know ggplot, it's the way of doing graphics in R. I know the guy who wrote that. And I also know the book, which is based on, which I have no clue about, because that's really, really weird stuff.
30:03
Okay. Yes. Oh, that's interesting.
30:24
So the comment was that there was a project for Python which also tries to implement the grammar of graphics. The grammar of graphics is the book by Leland Wilkinson, I guess, which does this theoretical stuff about graphics. So I knew this guy from a conference, and I had a look at the grammar of graphics, but that was way beyond my intellect.
30:46
That is really... Alistair? Alistair. Thank you. That could be pretty interesting. Okay, what I want to show you in the second part,
31:02
is I want to show you how real data analysis can be done in Python. I have selected one data set because I know from my lectures on multivariate statistics. It's from a book by Fluey and Rietveer from 1988, which is also used in a book from my Ph.D. father.
31:25
That's multivariate statistical analysis. And it's a data set about counterfeit, about faked, and about genuine, real, Swiss banknotes. The data set has seven columns. The length of build, the width of the left edge, the width of the right edge, the bottom margin, the top margin, the length of the diagonal, and the status.
31:44
Is it genuine banknote, or is it fake one? Let's have a look at some graphic. Here we have to... We sometimes have the graphics. Maybe we should blink in the same frequency, then the graphics is always there.
32:03
Okay, that's just a visualization of the variables, and I guess that's what the Beamer doesn't like. Okay, I guess there's not a real frequency. That's some stochastic process behind.
32:21
Okay, what we need to do is... Sorry for that. We import pandas, we import numpy. Okay, I guess actually for this I don't need it. And I import seaborne as SNS. That's this graphic library which is built on top of...
32:40
Smart plot lib. Just give us a second. I will try to reduce the resolution that might give us a technical advantage. So, Windows 10, I love you.
33:05
Okay. Let's try 10 something. If that is not working, I don't know.
33:31
Okay, just... Sorry for that. Okay, what I have now is I have the data set loaded. Well, it's just 200 rows, it's not so much.
33:40
And what I usually want to do is when I first get some data set, I want to get an overview. What is the data like and can I make sure that I load it? This is really annoying. I want to get a summary and Python provides a command here.
34:03
This is this describe. It gives me a five number summary. Five number summary is the count, the variable count, the mean, standard deviation, minimum, maximum. And the 25%, the 50%, and the 75% quartile. So, that's for these four variables that I have selected.
34:27
That's the five number summary. When you have a lecture on statistics, that's what you will likely encounter. Okay. I can also create a graphical representation of this five number summary.
34:41
That's a so-called box plot. So, that's where I apply the Seaborn library, which has a box plot command. And I simply say, well, on the x-axis, I want to have the status. It's genuine or counterfeit. And on the y-axis, I want to have the diagonal.
35:03
So, if we have a look at this diagonal, also in the scatter plot matrix, which we will see next, we will see that this variable holds some insight about what is genuine, what is counterfeit. So, that's a useful variable. That's the scatter plot matrix. I simply plot those points or this variable against some other variable, and I get some point clouds.
35:26
And what I have here is I also use the color to decide if it's a genuine or counterfeit. So, let me just check. Blue means genuine and green means counterfeit.
35:42
So, these are the counterfeit variables plotted left to diagonal. So, here we can also see some point clouds. So, a good insight. Okay. So, after that, let's just come to the final example. Let's do some cluster analysis with the data and to see if in the group of data, if we wouldn't know the status,
36:06
if we could find some groups. There are hundreds of algorithms for cluster analysis. You can spend a lot of time and a lot of books just learning about this. What I want to do here is some simple K-means clustering. It's rather simple to explain.
36:23
K-means means, K-means means. I define a number K, the number of clusters, which I finally want to have. And we're talking here about banknotes like genuine and counterfeit. So, I would expect to get two groups, one with counterfeit banknotes, one with genuine banknotes.
36:42
Okay. Then the algorithm works as following. This is here from Wikipedia for three clusters. Let me just explain it on this by example. Given we would have three groups, we would simply select three observations randomly.
37:01
Then we would calculate the distances from each point to these cluster centers. And there, where the distance gets minimal, we say, okay, that's where the point should belong to. That's the cluster. Like, for this cluster center, we take this one here and calculate the distance and agree in the end, oh, this should be the cluster center.
37:26
And for other observations, I do the same. And after some iterations of checking the distances again, we will get to some stability situation where the cluster centers are not changing anymore.
37:41
It will work. And what I can do is I can also do the same with Python because Python has its own library. That's a cluster. I simply import the K means and the VQ function. This is really horrible. You need to fix this.
38:02
So I again load the data. It doesn't really make sense. Sorry for that, folks. Download the slides from my home page or visit me in my booth.
38:21
What I simply do is I restrict the data frame to columns to just lengths and diagonal. I convert everything into a numeric array because the K means wants to have a numeric array. And then I compute the centroids.
38:41
That's the center of the clusters with the K means algorithm. And I assign each data point with the help of the VQ function one of these centroids. And what finally comes out is another scatter plot here which I will then also save as PDF file.
39:03
So what you see in my slides is directly the PDF from the code. And what I get is so far so good. A scatter plot matrix. Again, I haven't been able yesterday to kick out the assignment because that's basically something which I do not want here.
39:25
What I am interested in here are the two clusters. So these are not the assignments to the cluster based on the actual status which set genuine or fake. But it's the result of the cluster analysis that we have performed. And we could dig deeper into that and we would find that this one observation that we had.
39:45
Let me just show you. This one. If you see here the small dot. That is actually a genuine banknote. But when we compare the distributions it falls into the numbers of the counterfeit data.
40:01
And if we look at the results from the cluster analysis then this would be an observation which would be classified wrongly. Okay. This is where we could simply go on. Let me just come to a conclusion. Maybe that works. I guess it's really dependent on the amount of data which is on the slide.
40:23
Okay. Python with Pandas and SciPy proved to be a really valuable tool. Not just for my scientific work but also for the daily work that I need to perform in my department. It greatly simplifies my life in everyday analysis. I spend hundreds of lines just with programming Pandas.
40:44
And it's every time fun again. I can only recommend you to check it out. And if you want to know more or have some questions about what I've shown today. Just visit me in the Dante booth. We sit in the Menza. We have a lot about LaTeX.
41:00
I can also show you some LaTeX. And just come by. Okay. Any questions so far? Some literature recommendations. There's a life besides Stack Exchange.
41:20
What I normally do is I simply Google it up and find something on Stack Exchange. But there are some good books like Learning Pandas, Modeling Pandas for Finance, Python for Data Analysis and so on and so on. Have a look at that. Thank you.