We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Introduction to Pandas

00:00

Formal Metadata

Title
Introduction to Pandas
Title of Series
Number of Parts
43
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production PlaceErlangen, Germany

Content Metadata

Subject Area
Genre
Abstract
Pandas is nowadays the library of choice for manipulating and analysing structured data, providing high-performance, easy-to-use data structures and data analysis tools. This hands-on tutorial will give a basic introduction to pandas, guide you through the different data structures and its manipulation, explaining the the key concepts and defining features. No prior knowledge about pandas is required.
5
Thumbnail
1:34:10
33
Thumbnail
1:31:57
34
Thumbnail
1:28:12
35
Thumbnail
1:27:32
41
Thumbnail
1:31:21
43
Demo (music)WindowMenu (computing)Image resolutionNormal (geometry)RotationComputer configurationPrice indexInformation systemsElectronic visual displayFirefox <Programm>LaptopEmailDisk read-and-write headLine (geometry)Cellular automatonSurvival analysisCASE <Informatik>Distribution (mathematics)Bit rateCASE <Informatik>Physical systemGoodness of fitPersonal area networkLaptopDistribution (mathematics)Natural numberExecution unitComputer animation
Inclusion mapMilitary operationData structureLatent heatAttribute grammaroutputDatabaseMaxima and minimaCASE <Informatik>Survival analysisBoolean algebraSubject indexingData analysisDistribution (mathematics)Bit rateComputer configurationElectronic visual displaySeries (mathematics)Disk read-and-write headCellular automatonEmailSummierbarkeitLine (geometry)CodeGUI widgetDigital object identifierKernel (computing)BitImage resolutionLevel (video gaming)Social classService (economics)Electric generatorHand fanAreaComputer fileComputer animation
Distribution (mathematics)Bit rateSurvival analysisCodeKernel (computing)SummierbarkeitLaptopCellular automatonSample (statistics)CASE <Informatik>MeasurementDatabaseFunction (mathematics)Workstation <Musikinstrument>Plot (narrative)Arithmetic meanPrice indexUser profileData analysisSpreadsheetArray data structureRadio-frequency identificationType theoryTable (information)Data structureComputer-generated imageryAttribute grammarInformationObject (grammar)Read-only memoryCore dumpFrame problemTotal S.A.Source codeElectronic visual displayAreaData dictionarySeries (mathematics)Distribution (mathematics)Table (information)Different (Kate Ryan album)Social classInsertion lossRootDatabaseWorkstation <Musikinstrument>Data structureData dictionarySpreadsheetFrame problemComputer fileTime seriesSubject indexingMeasurementSeries (mathematics)Data typeCASE <Informatik>Revision controlLengthElement (mathematics)InformationRepresentation (politics)Poisson-KlammerKey (cryptography)Row (database)Default (computer science)Attribute grammarCombinational logicBitObject (grammar)Functional (mathematics)Set (mathematics)ConcentricPresentation of a groupAverageTwitterLatent heatMachine visionExecution unitAreaSequelMoment (mathematics)1 (number)Cone penetration testLaptopView (database)Multiplication signMedical imagingOrder (biology)Office suiteVisualization (computer graphics)Network topologyNumberSuite (music)Condition numberMortality rateResultantReliefPower (physics)Address spaceInheritance (object-oriented programming)Physical lawComputer animation
Arithmetic meanArray data structureOperations researchCellular automatonKernel (computing)Subject indexingBoolean algebraElement (mathematics)Maxima and minimaLatent heatInterior (topology)AverageCalculationSurvival analysisFingerprintCuboidRule of inferenceSuite (music)Condition numberMultiplication signCASE <Informatik>Series (mathematics)Process (computing)ResultantAdditionState of matterWeightPosition operatorTheory of relativityRootDifferent (Kate Ryan album)Frame problemOperator (mathematics)Pairwise comparisonArithmetic meanMaxima and minimaRow (database)Subject indexingFunctional (mathematics)Group actionBoolean algebraElement (mathematics)CountingFunction (mathematics)1 (number)Object (grammar)Computer animation
Cellular automatonMaxima and minimaSurvival analysisAverageCalculationData structureSubject indexingOperations researchMedianLaptopKernel (computing)File formatSummierbarkeitoutputFunction (mathematics)DatabaseTimestampAnalog-to-digital converter1 (number)BitRange (statistics)Group actionVideo gameSequelClosed setHypermediaForcing (mathematics)Core dumpSeries (mathematics)CASE <Informatik>System callSpecial unitary groupShape (magazine)WebsiteSurvival analysisReal numberDecision theorySoftware bugGame theoryComputer configurationTotal S.A.SummierbarkeitArithmetic meanInternetworkingReading (process)Table (information)Frame problemComplete metric spaceRow (database)NumberLengthMaxima and minimaComputer fileCodeFile formatDistribution (mathematics)MedianDefault (computer science)Functional (mathematics)HistogramPoisson-KlammerOcean currentComputer animation
Disk read-and-write headLaptopKernel (computing)ComputerComputer networkServer (computing)VideoconferencingPrice indexComputer fontData typeTimestampComputer configurationComputer fileNormal (geometry)BitRoutingElectronic mailing listOpticsAdditionExecution unitCASE <Informatik>EmailRepresentation (politics)Row (database)Computer animation
Structural loadInformationParsingTimestampPositional notationCellular automatonKernel (computing)Analog-to-digital converterPrice indexEmailDigital filterParsingDecimalDisk read-and-write headString (computer science)Uniform resource locatorTable (information)IntegerMultiplicationParameter (computer programming)Line (geometry)Computer fileBoolean algebraInternet service providerRead-only memoryFile formatLimit (category theory)Flow separationShift operatorComputer fileBitE-learningLaptopFrame problemComputer virusMathematics1 (number)Computer animation
Line (geometry)ForceSubsetElement (mathematics)Internet service providerParameter (computer programming)Computer configurationPrice indexIntegerString (computer science)EmailRead-only memoryBoolean algebraParsingTimestampLaptopCellular automatonKernel (computing)Greatest elementPositional notationLatent heatNetwork-attached storageIdentical particlesComputer fileDisk read-and-write headInformationEmpennageCore dumpFrame problemTotal S.A.CountingPlot (narrative)Subject indexingParsingComputer fileRow (database)String (computer science)Table (information)Flow separationParameter (computer programming)CASE <Informatik>HistogramBitPlotterFunctional (mathematics)Line (geometry)Semiconductor memoryDefault (computer science)Limit (category theory)Computer configurationNumberElectronic visual displayMultiplication signDisk read-and-write headWorkstation <Musikinstrument>InformationRepresentation (politics)MeasurementPrice indexAdditionElectronic mailing listFile formatTimestampLageparameterOnline helpFerry CorstenOffice suiteClosed setRule of inferenceConsistencyVideo gameRight angleDifferent (Kate Ryan album)Insertion lossData storage deviceAverageMusical ensemble1 (number)WordWebsiteMereologyUniverse (mathematics)RootState of matterSystem callSet (mathematics)Machine visionResultantComputer animation
Plot (narrative)TimestampComputer fileKernel (computing)Cellular automatonBoolean algebraPopulation densityAreaDistribution (mathematics)Line (geometry)Patch (Unix)Point (geometry)MereologyLaptopPlotterCASE <Informatik>MeasurementDistribution (mathematics)Graph coloringLine (geometry)HistogramType theorySheaf (mathematics)CuboidDefault (computer science)Different (Kate Ryan album)ScatteringPiDecision theoryLevel (video gaming)Closed setSemiconductor memoryMultiplication signRoutingInheritance (object-oriented programming)Computer animation
Digital object identifierPlot (narrative)Distribution (mathematics)Kernel (computing)Cellular automatonLaptopSubject indexingPosition operatorComputer fileInternet forumKeyboard shortcutAttribute grammarCartesian coordinate systemSimilarity (geometry)Array data structurePatch (Unix)Boolean algebraRow (database)Similarity (geometry)SubsetSheaf (mathematics)Default (computer science)Keyboard shortcutElement (mathematics)Error messageSelectivity (electronic)Auditory maskingFlow separationSubject indexingElectronic mailing listProgram slicingPosition operatorFrame problemScalar fieldPattern languageDifferent (Kate Ryan album)Boolean algebraAttribute grammarMereologyLine (geometry)PlotterPoisson-KlammerComplex (psychology)BitSquare numberDoubling the cubeBoss CorporationNumberRule of inferenceCondition numberCASE <Informatik>Workstation <Musikinstrument>Grass (card game)Volume (thermodynamics)Series (mathematics)MultiplicationVideo gameRight angleGroup actionPlastikkarteSystem callState of matterGoogolCommunications protocolMusical ensembleCollaborationismDemosceneCausalityCodecMultiplication signRoyal NavyPhysical lawAreaSpecial unitary groupComputer animation
Kernel (computing)Cellular automatonLaptopArithmetic meanMilitary operationTheoryGroup actionCalculationSummierbarkeitKey (cryptography)Musical ensembleLetterpress printingStatisticsFunction (mathematics)NumberGame theoryCondition numberCASE <Informatik>ExpressionKey (cryptography)Different (Kate Ryan album)Degree (graph theory)Product (business)Group actionParameter (computer programming)Row (database)Closed setOperator (mathematics)Rule of inferenceGraph coloringRight angleComputer virusMachine visionContrast (vision)Hill differential equationAdditionEqualiser (mathematics)Service (economics)TheoryPairwise comparisonWritingSingle-precision floating-point formatFunctional (mathematics)Type theory2 (number)Computer configurationBitSelectivity (electronic)Arithmetic meanSubsetBlock (periodic table)Frame problemSystem callBoolean algebraFigurate numberSummierbarkeitLengthLoop (music)Computer animation
Group actionIndependence (probability theory)Function (mathematics)Data structureLaptopCellular automatonKernel (computing)System callComputer fileSimilarity (geometry)SummierbarkeitMaxima and minimaDisk read-and-write headSurvival analysisAverageArithmetic meanCalculationSubject indexingPlot (narrative)Internet service providerCASE <Informatik>Musical ensembleMeasurementBoss CorporationLie groupOperator (mathematics)AverageOffice suiteSystem callHypermediaTime zoneGroup actionSet (mathematics)Digital photographyComputer configurationFunctional (mathematics)Combinational logicReduction of orderMaxima and minimaObject (grammar)Default (computer science)Summierbarkeit1 (number)Frame problemSubsetDifferent (Kate Ryan album)Single-precision floating-point formatComputer animation
AverageCalculationKernel (computing)Cellular automatonLaptopSurvival analysisArithmetic meanStructural loadSubject indexingPlot (narrative)Internet service providerFunction (mathematics)IntegerSeries (mathematics)Morley's categoricity theoremDedekind cutAbelian categoryCategory of beingSocial classObject (grammar)Different (Kate Ryan album)BitArithmetic meanSelectivity (electronic)SubsetFrame problemMathematicsGroup actionSurvival analysisRevision controlOperator (mathematics)Subject indexingDemosceneAverageRow (database)Default (computer science)PlotterSocial classFilm editingFunctional (mathematics)Number2 (number)Electric generatorCAN busSelf-organizationLevel (video gaming)Office suiteWeb pageMereology1 (number)Parameter (computer programming)Rule of inferenceFingerprintResultantNormal (geometry)Stress (mechanics)Right angleHecke operatorDecision theorySheaf (mathematics)AuthorizationGraphical user interfaceComputer animation
LaptopCellular automatonKernel (computing)Arithmetic meanAbelian categoryCategory of beingObject (grammar)Dedekind cutSocial classAveragePlot (narrative)Function (mathematics)Internet service provideroutputParameter (computer programming)GUI widgetStructural loadString (computer science)Price indexDisk read-and-write headSubject indexingPartial derivativePartial derivativeFrequencyString (computer science)TimestampSubsetSubject indexingFrame problemCASE <Informatik>Functional (mathematics)BitRow (database)CodeLine (geometry)Social classComputer fileObject (grammar)Multiplication signTouchscreenBinary fileTime seriesInformationMessage passingResultantGroup actionPower (physics)Local ringLie groupAverageSource codeDifferent (Kate Ryan album)Mobile appHypermediaProcess (computing)ArmGoodness of fitRoutingComputer animation
Disk read-and-write headMaxima and minimaString (computer science)FrequencyWorkstation <Musikinstrument>Time evolutionAveragePlot (narrative)Kernel (computing)Cellular automatonCartesian coordinate systemObject-oriented analysis and designDrop (liquid)Price indexWorkstation <Musikinstrument>Profil (magazine)Cartesian coordinate systemMultiplicationAttribute grammarAverageCASE <Informatik>Drop (liquid)Functional (mathematics)Line (geometry)DatabaseEvolutePlotterFrame problemArithmetic meanDefault (computer science)Time seriesString (computer science)Different (Kate Ryan album)FrequencyPointer (computer programming)ConcentricBitReading (process)Row (database)1 (number)CalculationSingle-precision floating-point formatRepresentation (politics)NumberMultiplication signUniform resource locatorCombinational logicGroup actionElectronic mailing listLink (knot theory)Direction (geometry)Resampling (statistics)Bounded variationPattern languageGraph coloringShape (magazine)Condition numberTable (information)Subject indexingLogicSeries (mathematics)SubsetWeightInsertion lossCollisionDecision theoryProduct (business)CAN busExecution unitPower (physics)WebsiteVisualization (computer graphics)Game theoryFrustrationBootingGoodness of fitPhysical lawLogic gateMechanism designStructural loadHypermediaRoyal NavyArrow of timeLevel (video gaming)State of matterRule of inferenceAreaForm (programming)Moment (mathematics)DemosceneSystem callConnected spaceLocal ringCopyright infringementMessage passingCausalityDialectMachine visionParsingAdditionMereologyWhiteboardResultantSelf-organizationForcing (mathematics)Mountain passComputer-assisted translationArmComputer animation
Transcript: English(auto-generated)
Okay. Good morning. So, my name is Joris Anapos, and I'm going to give you an introduction to Pannes tutorial. Just to start, who has already used Pannes before?
Some of you. There are so many who don't use Pannes. So, I will do some explanations. There are also a lot of exercises.
So, we are going to do, as I showed before, the Pannes introduction notebook. So, the first one. There is a lot more material. We are only going to do the first tutorial.
If you want to do more notebooks in signals target exercises, and the solutions are always provided as well, you have a lot more material there.
So, the first notebook. So, just to show you that I don't have to explain it later.
So, there are exercises in Greek that are exercises that I would like you to try yourself. My advice is to first try yourself. I will later on also go over them. If you already want to look at the solution after you try it, you can recommend this one and run yourself.
Then you will see one possible solution. So, for example, if I'm only about to go a bit faster, you can do the exercise and look at what I provided as a possible solution. Okay, so to start, for those who are not familiar with Pandas, I'm going to give a quick introduction.
By this, I would use it by starting with a short piece. So, for example, I have some data about the Titanic passengers and whether they survived. So, I can read a CSV file with Pandas, and I can easily look at the data.
There are several columns, and what I can do with Pandas is very easily, in search, kind of table the data, look at my data and ask some questions. For example, what is the age distribution of passengers? Or how did the percentage of people that survived, how did it differ between, for example,
males and females, and you see there is quite a difference. Or how did it differ for the different classes of the passengers. So, you can very quickly look at your data, analyze your data, manipulate your data. No worries about what I did here exactly.
We are going through the notebook, and you are going to be able to do that in the end. There is one data set. Another data set is about some air quality measurements. So, time series of how the measurements of the air quality in a few Paris monitoring stations.
And also, with time series Pandas, it is very easy to use the time series. So, you can look, for example, at how the trend looks like with the monthly averages over the different years. And you see a small decrease, slight decrease, or yearly averages like this.
Or we can look at how they are not patterned. Typical concentrations over a day, you see in a week, day, you see the typical concentration peaks during the morning and evening rush. So, but we will come back to those.
So, what is Pandas? I suppose that most of you followed yesterday also the Python and NumPy introduction tutorials. So, NumPy arrays, they can only have one data type.
Often, when you have real-world data, you often have tabular-like data, like a SQL database or a spreadsheet in Excel. And often your columns have different data types. So, that is something that Pandas brings to. So, based on NumPy, it provides a table-like structure, very similar to in R, the data set in R,
where you have different columns that can have different data types, and also where your columns have names that you can use. If you have such kind of data, then Pandas provides many functionalities to
import those data, to clean up the data, to process, explore, analyze, etc. So, it is specifically targeted to tabular 1D or 2D data. Of course, just to be aware, not all data fits in such a structure. For example, array data, if you have images, or if you want, if you work with, so it gets
the numerical algorithm, say, just stick to NumPy, and you should be able to work with Pandas for specific data. To start, so, the basic data structure that Pandas provides is the data frame in series.
So, the data frame is the table, the 2D table, where you have, your data is inside, you have some column names, and we also have an index, which are the row labels. So, if you look at the data frame we had before, the presentation is not that nice,
but you see here, we have our column names, different column names, and these are our row labels. If you want to look at those attributes, so we can access what the column names are, or what the index is, through those attributes.
The index is just the default range, but you will later see that we can also set some other data as the row labels. And you can also access the column names, where you can see different columns.
As I explained, so the data frame can hold different data types, so the Dtypes attribute can actually see that each column has its own data type. We have some integers, some column data, the objects that you see there, that's often if
you have strings, a column with strings, then you will see that it has an object Dtype. Another method that gives a bit of an overview of how this inside the data frame is a .info method, where you see the combination of the info before, the different data types, how many values there are inside each column that are not missing,
and some other information. You also have the .values attribute, which gives you the underlying array, but in case of a data frame, it's not always that useful, because, as I said, so NumPy can only have one data type,
and this will give you data with different data types inside the array, so it's not that useful in this case to get the actual array. Before I read the data frame, I created the data frame by reading it from a CSV file. There are many other ways you can create a data frame.
Another way is, for example, by passing a dictionary to the data frame object, where your keys of the dictionary will be the column names and the values of the different columns.
For example, in this way I can make a small data frame. The data frame has a 2D object, the series is a 1D object, or, for example, if you access a single column of your data frame, then you get back a 1D series.
How can you access such a column? Using just square brackets in the column name. For example, if I want to access the h column, this is now the series. You see a small difference in the representation in the notebook? In that way you can very quickly see whether it's a series or a data frame.
The data frame has HTML representation, and the series is just a text representation. I'm going to assign that series to a variable h. Just like the data frame, the series has an index, in this case it's the same, and has some underlying values which are stored in the non-parameter.
As we can do with non-parameters, we can access certain elements of the series by indexing, for example, the first element.
But, as I already mentioned, now the index is just the default range, from 0 to 2, and the length minus 1. But you can also use row labels that actually have a meaning.
If you want to set a certain column of your data frame as the index, you can use the setIndex method. So now you will see, here in bold, it's not anymore 0, 1, up to the length, but the length of the people that are of different passengers.
If you then get back the series for the h, you see now the index of different names, and you can access a certain version of a dictionary.
You can access the age of a person based on bypassing that name. That's something you could do with a dictionary in Python. What you can't do with a dictionary is very easily do elementized operations as you would do with a non-parameter.
For example, I can multiply each element in the series of thousands, or there are also a bunch of minutes available in the data frame in the series to do all kinds of operations. For example, I can take the average age of the passengers, so the average age of our Titanic passengers was almost 30 years.
What you can also do is, for example, for indexing, Boolean indexing. So if I do something like this, h is bigger than 70, so it's a comparison
which will also work element-wise, and it will return a series with false and true. We only see here false and true values. If you have such a condition which generates a series of false and true values, we can also use that to filter our original data frame.
Like this, if you pass this between the square brackets, we can select all people who are older than 70, all passengers of the Titanic with such an age.
And there are many methods available on the Panas data frame and in the series. Too many to go into detail, but if you tap completion, you will see that there are quite a lot of typical mathematical functions. For example, to take the maximum age or the minimum age, and all those typical things that you would expect are all available.
Most of the methods, for example, you know, you've maybe learned in NumPy, and those minimum action mean functions are also available on Panas objects.
But there are also some additional ones, some typical ones for data frames. For example, we have the value counts method, which is just a count of the values which are inside the column. So it will give you the count, so in this embarked column, which stands for the place where people embarked on the Titanic.
And it will count, there are 644 rows that have the value S, etc. So, there's a lot of talk, so let's try to actually do some use on this yourself.
So there are two small exercises to just start using it a little bit. As we say, if there are any questions when I'm speaking, just interrupt, raise your hand, or if you need an exercise, please ask questions.
Panas introduction. So the first, bit number one.
Is it up there? Maybe you need to download, we can only update it yesterday, even further late. So, it's maybe the reason. Well, just a little bit after 12.
Do you have internet or otherwise a stick can come around? Ok, so for the people who did it before, today, you have to, let's just scan this to download again.
There are people who still need to update, no, here's a stick.
Ok, I will go to the exercise.
So the first question was, what is the maximum fare that a passenger takes? So, we have a data frame, we want to know something about the fare code. So how do we access the fare code? We use square brackets and we use like this.
So this gives us the code, all the values of the fares that we did. And then we can just use a method on the series to know the maximum. Or to know median, like this.
So you see that the maximum is quite a bit larger than the median. Something we could also have done to explore the fare is to make a histogram of it. To see the actual distribution of the fare. In this case, we would maybe make some custom bidding and just the default, how the default bidding is.
Second question was, how do we, what is the current survival ratio of all passengers?
So, if you look at the survived column, you can see there are zero survival values. So I didn't really explain it explicitly, but zero means the person is dead, one means he survived. So what we actually want to do is, we want to count the number of ones in this column.
So, there are some different ways to count something. In principle, in general, if you want to count something, you could do a value count, as I showed before. But if you want to count something very specific, you could do like this.
Check which values are equal to one, which gives you a series of true and false. And then you want to take a sum of these, to know how many true values there are.
You could do like this, and then take a sum. This gives us all the number of passengers who survived. As in general, but of course we already had zeros and ones. So we can actually just take the sum of our column, which will give you the same.
The other method to look how many are equal to something is more general, you can use that for any value. If you want to know the percentage, you have to divide by the total number. For the total number of rows, one way to get this is the length of the data frame.
This length of the data frame will give you the number of rows. If we execute this, and we see that 38% of the passengers survived the Titanic. Yes, so you can indeed see, and you take the sum and divide the length.
In this case, it's of course, we can, to calculate the same, you can take the mean of this column. Like that. Any questions so far about those exercises?
Then we'll go a bit more in detail, give some explanation about data import and export. So, you already read the CSV file, but there is a wide range of data formats that pandas can read or also write.
In general, what you need to know, all the import methods are called pd.reads something. So if you do tab completion on this, you will see there are quite some. So we can read from SQL, from SaaS, we can read HTML tables from the
internet, we can read Excel files, CSV files, Stata, if you want to have Stata data. In the latest release, there is also the parquet format which can be written.
So, all read methods, pd.reads. All export methods are your data frame dots 2. And then, similar to the certain formats. So 2 CSV, 2 Excel, 2 HTML, if you want to have the Latin table, you can do that as well.
Or 2 SQL, 2 Stata, etc. So there are the export methods. The read CSV function is, I think, one of the most used functions of pandas. And it is also a very powerful or complex, you could say, function.
If you look at the help, you will see that you have quite some options. You could say that it is a very bad design, but unfortunately many people have the weirdest kind of CSV files. And we can say that the pandas read CSV can read most of those weird things.
But luckily, so there are many options that you can look into if you have some kind of special CSV file. But luckily, if you have just a normal CSV file, comma separated, header, things like that, you don't need those options. We already read the CSV file. If you look at the actual data, this is just a nice normal CSV file.
And then it reads fine without providing any options. I have another data file. It's this one.
This is a bit more specific. You can see if I can show it. So here, I can't make it larger.
But you can see it has some special things which are default. It has two header files because there are some kind of units on the second row. So we will have to skip this. You see it's also with same columns and not with commas.
And there are also some dates that we might want to parse. And the missing value representation is n slightly d. Not fully sure what it stands for. Maybe no data or something. So there are all things that you can specify in a CSV file.
So you can try to read this file into a data frame using GWT CSV.
To remember, if you want to look at the different parameters, you can in the notebook either use shift tab to see the health.
You can look at, for example, how it needs to specify the separation. The limitations we can use is set keywords. What we also need is the formatting is not very good.
So, shift tab. If you are inside the function, you do shift tab and then you get this. Another way to do it, to get it in a bit easier to read format,
or you can of course look online in the documentation page, is to put a question mark after then execute yourself. Then you get a little below something that you can also read. So what you for example also need, here is a skip rows argument where you can parse a list of rows to skip.
So what are the things that we need? We needed to specify that the separator is a comma separated and we want to skip the second row, which is row number 1, because the help set started to come to the table.
So if I execute this one, I see that I already have some nice D format data frame. The only thing, I still here see M slash D, which is a string in an American, so I don't want that.
There is, so the exercise notes to check the N8 values keywords. So if we look at that one, somewhere here, so a scanner or a list like additional strings to recognize as a missing value.
And there are a bunch of strings, an empty string, or all those strings are recognized as missing values, but the M slash D is not in there, so we can also provide this N8 values.
And now you will see here it's an actual missing value. We use the number, this is NAN stands for, so there is a NAN by NAN value which we use as an indicator for missing values.
So, very short, just to show that you can tweak quite a lot for when you are reading CSV files with PaaS. And apart from that, we have many other import and export functions.
There is one last thing, this is a time series, and I like to have my index as the, my time stamps as the index. For that I can also specify that my index column should be the first column,
so column number zero, and then you will see time stamps are now inside the index. And the last thing, but that's something you really need to know, if I want to actually parse those dates, I can also say parse dates is true,
because now it will still be strings, it's just a CSV file, it's text data, but with parse dates is true, I say to PaaS, let's try to parse those values I set as the index to actual dates. Now the representation looks the same, but now if you look at the index,
you will see the actual date time. If you didn't fully write this, you can always look here at the solution, which will do normally the same. So the NO2 file that we, the data, so it's our measurements of air quality,
nitrogen dioxide, therefore I used NO2, of some stations in practice. So it has quite some data, if you look at the length, it has more than 100,000 rows in the data frame.
So we have some methods in PaaS to quickly explore a bit of data. So something that you can use is the head function, which by default will give you the first five rows of the data frame.
You can also specify how many you want, you can say you have the first step. I set my display to only show eight rows, because normally the default is to show 60 rows, and I put in the first cell, if you excluded that, I changed the option to PaaS to only show a limited amount,
because it's easier to work with in the tutorial. By default it shows 60 rows, and then a head function to show the first five can be very useful. The same if you want to see the last ones. So you see it starts at the January of 2020, and the data goes up to the end of 2016.
Info already showed above, you can see how many values there are, you can also see there are some missing values, how much memory takes the data frame. We also have a describe limit, which gives you a bit the summary statistics of the different columns,
so you can see how many values, the average, standard deviation, some of the different quartiles in the dataset. So that's a way to get the summary statistics of all of this. You can also quickly visualize some of the data.
I already showed the histogram function, but the general plot function is just the plot method. So by default, if you use plot, we get a line plot of all the columns.
In this case there are many rows it takes to plot this with Matplotlib, and you also don't see that much, because there are more than 100 data plots for each line of the plot. So the default is just to plot the different columns as a line plot.
But with the kinds, keywords, you can specify some different plot types. For example, you can see here, the line plot is default, but you can make a bar plot, a history map, a box plot, a pie plot, in case you would want a plot or a record, recommend that, a scatter plot, etc.
So in this case, we take of each column a box plot, which gives you an idea of the distribution of each of the different sections. Or of a single one, you can take for example the histogram of the distribution of the measurements.
So, small exercise to go back to the Titanic dataset. So we call this, normally it's still in your memory, it's called DF.
Let's now try to plot the age distribution of the passengers of the Titanic.
So the age and the age color can act like this. We want to plot this, a default line plot won't make any sense, so we can't say we want another kind of plot,
a histogram for example, to get distribution. For the histogram, we also have the hist method, which will do the same, which is a shortcut for a histogram, but for the other plot types, you need to pass it as a kind to the plot method.
These are already mentioned, the default is a line plot, so I will go to the next section. I already showed some data frames, I have to explore them, but very often you want to do some subsetting of your data frame,
you want to select certain rows of your data frame. And because Pandas uses labels for the rows and columns, we introduced some complexity, because we can select both by those labels,
by our column names or by our row labels, but you can still, just like with NumPy, you can also select a certain part of the data frame based on the position, whether it's the first, the second, the third row, etc. So we have two ways to select data, by label or by position.
And because there are two ways, and sometimes you can mix them a bit, that gives a bit of complexity to indexing in Pandas, but of course being able to select by labels is also very powerful. So, up to now we always selected some data using just the square brackets,
so that provides some convenience shortcuts to select a single column, and that's what we already did up to now, like this gives you a single column. We can also use it to select multiple columns by passing a list.
So you will see now it looks like you have the double square brackets, you have just a list that you provide, and this is the list of columns inside the square brackets to index. So yeah, double square brackets. So this gives you a subset of columns of your data frame.
What we also can do is slice the rows. On one hand you can select some columns, one or multiple. Also, if we provide a slice or a boolean mask, it will not select from the columns, but from the rows.
So for example, about rows 10 up to, in this case, position, so 10 up to 14, so the five rows there. This is rather limited because either you select a column or you can slice rows, but of course you want to do some combinations,
and therefore you have the loc and iloc attributes. So just as we have two different ways to select, by label or by position, so therefore we have two different attributes,
loc to select by label, iloc to select by position. So I will set the name column as my index, so I have not set just the default integers, not anymore as my index, but actual names.
And if I now want to do some sub-selection, it's always inside loc. You first pass how you want to subset the rows,
and then how you want to subset the columns. So like this. This is always the pattern, the row indexer, column indexer. If you only want to select the rows, you can leave out the columns, but if you want only to select the columns using loc,
you have to provide just the column for a full slice, give me all the rows, and then I can select something from the columns. So this is the pattern how you use .loc. It's the same for iloc. What can this be? Those rows and columns, similar as we already seen,
can be a single element and a single scalar to access one element. It can be a list to subset, select several rows or columns, or it can also be a slice, or it can be a Boolean mask. So a small example.
For example, now here, this is my row labeled, this is my column name labeled, so I just passed two single elements, so this will just access a single element from the data frame. But in the second example, for example, what I do here, I pass the slice from Mrs. Elizabeth to Mr. Anders,
and then I say give me all columns for this slice. This will give you a subset of the rows, that's all columns.
Very similar with iloc, row indexer, column indexer, but now we just use integers, and it's just like with an umpire array, you can, in this case, you can do some slicing, but also here you could pass a list, for example, the second, and some other columns, for example, like this,
and that will work as well. So let's select some columns. I've now showed how to get some data. The same syntax you can use to also assign data to the data frame. For example, if you want to update a certain value,
if you look at the data frame, so it's actually the first one, here are low values, so if I now set this to 100 to survive column,
and you will see now this value has been updated. Similarly, with slices or masks of lists, you can do a selection and assign a value to that selection. This is something that I already showed and mentioned before.
If you do a certain kind of Boolean condition, for example, where the error is bigger than 50, this gives you true false values, and that's a Boolean mask, which we can use to index our data frame. You can do it like this by passing it to square brackets,
or you can also do it inside the block, and it will give you the same. Then you can, for example, combine it with I want to access the survived column, like this. Okay, let's do some exercises on this index.
Let's try the first exercise. We have a data frame, and the question is, select all rows for male passengers
and calculate the mean, age of those passengers. So how do we select all the rows for the male passengers? We have a column called sex, and we round those that are equal to male. So we can do this with a comparison,
where they are all equal to male, and this condition will give us true and false values. It gives us a Boolean mask, with which we can subset our data frame. Let's call this, for example, the males.
So this will give us all the rows of the data frame, only where the sex is equal to male. Next, it asks the mean age of those passengers, the males, so we can get the age column
and we calculate the mean. I now did it in two different steps. This is a selection of the rows, this is a selection of the columns. I could have also done this in one step, using the block. So always rows, columns.
So what is my row selection? That is this Boolean mask, where the sex is equal to male. What is my column selection? That is the age column. This will give us all the ages of the males,
and we can calculate the mean of this. And then doing the same for the females.
So you see the females are a little bit younger than the males. The second question was,
to count how many passengers are older than 70. So we want to know something about the age, we get the age column, but we want to know whether this age is larger than 70. So therefore we can use a comparison,
and this gives us the false and true values. What we could do is subset our data frame, and look at those, and it gives us a few rows. What we could do is count the number of rows that are inside this data frame.
How do you know the number of rows? Of course we can use the length of the data frame, which gives us 5. Of course, I showed this before, if we just want to count the number of values inside,
the number of true values, whether some expression is true, some condition is true, we can take this level, which in this case is a bit easier, we don't have to actually do the subset of our data frame
to know the number of rows we would get. So that's very short, and some introduction to getting a subset of your data frame. If you look at the solution that we have for the first exercise,
so we calculated the mean for one condition, calculated the mean for a second condition, and for the different values in that column of male and female. So that's a very typical thing you want to do. You want to group your records, your rows,
by a certain column, and for the different groups do a kind of operation. In this case you want to play the mean. Because it's a very common operation, there is some functionality to do that, much easier than to write this fully yourself, and certainly if you would have more options than only male and female,
writing in that type would be very tedious. So, what I said was, grouping by a certain column, in this case the groups are male and female, and then applying a certain operation, this is called the group by operation.
So, let's as an example look at a small data frame. So, this data frame, with a certain key ABC,
so that our groups add some values. So, we can take the sum of a certain column, but as I just said you typically want to take the sum for the different groups. To not have to write it on multiple lines, we could do also a for loop like this. So, here we print the sums of the different groups.
Of course we don't have to do that in a different way. Therefore, we want to do the group by operation. Here is just a figure which illustrates a bit what actually happens under the hood. If you have a data frame and you split first into different data frames
based on a certain column, in this case the key column, so you have a subset data frame with all the A's, all the B's, all the C's, then you apply the function that you want, in this case a sum on each group. This gives us 15 for the first one,
13 for the second one, 45 for the C's, and then you combine them again in a resulting data frame. So, therefore this kind of operation is also sometimes called split, apply, combine. Also, in SQL you have such a group by operation. So, how does this look like?
This operation, how does it look like in Pandas? We have a group by that's on our data frame. You pass to group by. You pass the name of the column. It's one of the options that you can pass the name of the column on which you want to base your different groups.
So, it was the key column. Next, as a method on this result, you can specify how you want to aggregate, which function you want to apply. So, in this case you can take the sum by this.
So, there are quite some methods available on a group by object. So, you can also do median here, or you can take the minimum. So, many of the typical reduction methods that you can do just on a data frame, you can also do on a group by operation.
You can also, by using the aggregate methods, pass the custom function. In this case I reused again the sum, or for the default ones you could also specify it as a string.
So, the default ones, sum, median, maximum, are recognized as well. Here you can pass any kind of function. If it's passed the data frame, that does some kind of aggregation.
There are many more methods available. You may want to select it on a single column, because by default this function will be applied on each column. In the small example there is only one column, if you apply it on the Titanic, there are many columns.
If you want to apply our function only on a single column, you can, as we would select a single column from a data frame, you can also do it here. You could of course put it at the end as well, but then by putting it before applying the function, you need to select first a column on the group by operation.
In this way you avoid that, and the actual calculations happen for the other columns. Okay. That's the group by operation, how it works. So let's now try this out on the Titanic data set.
So what we did before was to calculate the average age for each, sex for the female and the males. So if you remember we did it here on those two lines. So let's now get the same results, those 30 and 27 or 28, but using group by.
Okay, we'll go through the different exercises. So the first was, and to do the same as we did before, the average age for each sex. So we use group by.
We want the group by males and females, so we group by the sex column. And we want to take the average. By default this will take the average of each column. So if you only want the average age, you can select the age like this,
and then it will only give us the average age. So you see, and we have the 28, 13.7, as we had before, but now much easier to calculate. Very similar as this one, calculates average survival ratio of all passengers.
So this one is just to remind you that if you want the percentage of people who survived, you can just take the mean. So an average to us is 38%.
So if you want to know this for the different passengers, you can calculate the survival ratio for different subsets. This one just asks for the survival ratio of all passengers. So for this we don't need the group by,
because we just want to sub-select based on the age, all those that are on the 25, and take the mean of the age column here. So if I do now mean again to take the all column here,
the mean of all columns, I only want the survived ratio. So I can select both the rows and the columns by using lock like this. Now I get only the survival ratio. So you see, the younger people have a slightly higher
chance to survive the tatank. Since you want to do that without the lock, and just in a scene and then in a scene again, why would you prefer one or the other? So what I did first was this.
I made a subset of my data frame. So this actually, if you leave out the mean, this gives you again a data frame. And based on this subset of the data frame, what I want to do is select a single column and take the mean of that.
So if you just do it sequentially, you can't do it like that. It will just work as well. Only I need to take the survived to the same value. Like that. So this works perfectly as well to get some data and populate something. So for this exercise, it's just as good.
The only thing that you have to be aware of is called change indexing, because we change different indexing operations. The thing that you have to be aware of is that if you do this, and you want to assign a certain value, you want to set values,
not only get them to calculate something, but to assign values that you can't get into trouble. Well, if you do it like this, this will also work to assign something. For example, if I would have put all the values for the survival of the people
younger than 25 to another value, I can do that a little bit longer. With this change indexing, it might work or it might not work, depending on what kind of sub-selection you do. Very short for those who want to know. The reason for this is because the first operation,
depending on what you sub-select, can be actually a copy of the data frame, and then the second selection that you do, you're not making a selection on the original data frame, but actually on a copy of it, and if you then assign a value,
you're assigning a copy and not in the original one. That's probably a bit more advanced topic, but by using those change indexing, you can get into trouble a little bit with that. If you're using the look-like here, that should be no problem. I also just saw the second version
with a lot of action to pass it on. Maybe people can do the whole thing in one step rather than having to index. Yes, that's possible. Next class, what is the difference in survival ratio between people?
Sexes, so again, we provide by the sex code, but now we want to calculate the mean of the divided column values. You can see, actually, females have a much higher chance to survive than humans.
Next question. Can we do the same, but not for the sex, but for the different classes? You have the first class, second and third class on the Titanic, and we can also calculate the survival ratio for this.
What we do is just replace the sex with the class, which was copy problem, I think. Like this, you see the different values for the classes, and also ask, can you make a barcode for a little bit more visual, so what we do on the results, we do a barcode like this,
so now we get visually, you see that the people who pay more, which are in the higher, the first class has a higher chance to survive than the people in the lower or higher classes.
There is the last exercise that we have something provided, because the question was, can you make a barcode to visualize the average fare that people pay depending on their age? What I provided here is using the cut function.
The cut function is a function in pandas, if you have a column with numerical values and you want to read it somehow, you can use the cut function. So if I execute this and I look at this function,
you will see now every value, which has an actual age, now has a class assigned to it. So whether it's between 20 and 30, between 30 and 40, etc. So now we made our numerical column, our continuous column, we made it a big column.
And this way we can, for example, group by the scope for each class. We wanted to know the question of the average fare,
so we take the fare column, take the mean, like this. So now we have the value for each class, and again we can make a plot of it, like this. So here you see now for the different classes,
each class is the price that they get. Could of course add to the same line as the previous examples, we could look at the survived code. So you will see for the different ages, how many people survived.
So what I did here, is providing the actual bin ages.
So if I remove this, it's a missing argument, but what I could also do is say I just want 10 bins, for example, but then it will, I think in equal bins, just divide from bin to maximum equal bins,
which gives you a bit like this, which are not very nice bins. So what you can do is here provide the actual, the different ages of the bins. The way that it was, I said I went from 0 up to 90 steps of 10. So 0, 10, 20, it's 5.
Then it will give us a bit nicer bits. Okay. Are there any more questions related to goodbye? Or all the previous things we saw? Then I will go to the last topic I want to show you,
a little bit of time series data. It's also something that VANES provides some functionality for,
to easily work with that kind of data. So we previously already practiced how to read the CSV file with the air quality data. So you can't see it on the screen here,
but I set the first code on the index and I parse the dates. So if I look at the first data, you will see that now I have here my time stamps in my index. So if I look at the index, it will say dates and time index.
So if you have actual, if you have a time series and you want to make use of the functionality that VANES provides for time series data, you have to make sure that your index is an actual daytime index and not just, for example, the streets that are representing the different daytimes.
So here you see the daytime index. And one of the things that it makes possible is that you can very easily select a subset of the data. So as we did before with the slides, but instead of providing actual timestamp objects or daytime objects,
because VANES always is a daytime index, so what you parse there, I will try to parse it as a date. So you can specify it as a string and it will be interpreted by VANES as actual timestamp. So if I, for example, in this case I want to slice my full data frame
and I ask for the hours from 9 up to and including 12 o'clock of January 1st up to the 10th, so I can do it like this. So that makes it a bit easier to index your time series.
You will see that you can parse the strings here. A bit more magical feature is that you don't have to provide the full string. So here I specified up to the hour because my data has an hourly frequency.
But you can also, if you make a short string, which is called a partial daytime string, VANES will interpret this as, okay, I asked from January up to March, so it goes from the beginning of January up to the end of March.
This is a bit of a convenience feature that you can use to quickly subset certain data. I could also do the same with, for example, the years, you want to have the all data of 2012 and 2013.
I can do it like this, so from the 1st of January of 2012 up to the end of 2013. Something else that is very useful is that we have on our index, so our index is a daytime index, and typically you want to know something about your index.
For example, you want to get the hour of each time stamp to, for example, select all the records of a specific hour. So you can access those as attributes. And all the typical things you want to know about your time stamps are available.
For example, the hour, and also get, as I showed, the year. For example, there are things with week completion, it's slow.
For example, the weekday, the week of the year, weekday will give you the market value, whether it was a Monday, Tuesday, or an hour to Sunday, etc. Also, the minutes, in this case it will be hour zero, of course,
we had minutes, etc. So that's very convenient to access some information about your time stamps. If you plot all those series, so you see, we had actually,
I already showed this plot before, because we have an hourly frequency of our time series, and this plot is not that useful. Often you want to convert the frequency of your time series to something else. For example, instead we have one hour frequency of our time series,
and maybe we want to calculate daily average concentrations. Instead of hourly concentrations, I want daily concentrations. For that, there is a specific method which is called resample. So to resample your time series to another frequency.
So if we look at the data, we have hourly data, so for each hourly we have a concentration. The resample method works very similar as a goodbye. So what we do, we say .resample, we specify the new frequency that you want as a result,
and then you specify the methods to calculate the new value, because if you go from hourly frequency to daily, we have to somehow aggregate our 24 values in this case. So then again, you have a lot of methods that you can specify,
like mean, median, median, maximum, etc. So if I execute this, you will see the results. It's now average concentrations for each day for different colors. And then again, you can here put different ones, for example, minimum, or other methods that are available as well.
So this string that you have to provide here, you can specify any frequency. There is a link here for the different strings that you can use for that,
but the main ones that you can typically use, for example, the hourly is just a capital H, but we already have hourly. You can go to daily. You can also go to monthly. Now I have an average value of each month. For example, I have a plot now with monthly data. If you want to have yearly data,
it's not capital Y, but it's the A of annual. So now I have averages for each year. What you can also do is, if you would like to multiply the frequency, for example, I can say every two months,
that will also work. Those are really short. So to end this tutorial, let's do some exercises based on these data sets, or the air quality data sets, using the functionality that we have seen.
For the first one, and for the yearly averages, you can just use resample. There is also an impression to add to both the overall average,
which I mean the average of the four different stations, to first calculate this, and it was specified here, and let me show this. So if you want to calculate two, so if you take the mean, by default it will take the mean of each column.
So in this case you get just four values. If you want to take the mean out of each column, out of each row, you can pass the axis keywords to specify a different direction that you want to take the average. So for example like this,
now you will see instead of one value for each column, I get one value for each row, which is the average of the four different stations.
So by default, if you don't pass anything, it takes the average of each column. This actually, if you look at the documentation, you see that you don't see the default. For example, if that's not good,
you should add to the documentation. So if somebody wants to do a PR, a public custom panel to provide the default here, it would be a nice contribution. But the default is to take, it's actually zero. So this means in this case, I want to take the average along my index.
So for each column, I want to have one value. That's the default. That's typically if you, for most of the methods, for example, the median, maximum, and spread, they all will take an application for one column. By passing here one, it says I want to take the average along my columns.
So all the different columns. So for each row, I want to have one value. If we do this, we will see now, instead of having a single value for each column, I have now a single value for each row. Like this. Which will be the mean of the four different columns.
We can, for example, add this to our data frame, as maybe average, like this. And then you will see,
now we have a new column in our data frame with the average value of the four other columns. So because it's almost time,
we will go through some of the exercises. So the question was here, can we plot the evolution of the yearly averages? So how do we do this? We use resample. We say we want yearly averages
by resampling the A for annual and then taking the mean. This will give us yearly averages and then we can make a plot of this.
So in the original one, in the solution, you will see here, I did it separately in two lines, so it didn't assign the new column to give a bit of styling to the overall mean,
to edit some other lines. So now you see here, it's a bit thicker than the black line. For the other exercises, I'm going to drop again this column. How do you drop a column from a data frame?
It's also nice to show otherwise. So you have a drop method which takes a single column or multiple columns. The only tricky thing is that if you want to drop a column, you have to specify axis is zero
because by default it will try to drop a row of your data frame. Like this, if you now look at our data frame, it has again the four original columns. Now of course all this, just redo the read CSV to get the original column.
So the next exercise asks for the typical monthly profile. For the different stations. So how do we do this?
It's in this case not with a re-sample, but you actually want to take the average concentration for all the different months over the different years because we want to have the typical average concentration of January. So we want the average of January of 2014
but also of 2015, 2016 and all the other years we have. So how can we do that? We actually have, remember, that we have the month available as attributes of our index. So this gives us those numbers
and actually we want to provide those numbers. We can do that by assigning new column in our data frame and then we can use provide, we can use provide
and take the average and that's in our deep logic. So you see now we have one value for the 12 different months and you see the typical
pattern throughout the year where you see that the air quality is a little bit better during the summer and a bit worse in the winter. Typically because NO2 is also a minute for heating and in the winter we have the atmospheric conditions are so that air quality is more trapped
locally than in the summer. In this case we assigned this to a new column and then we did provide the column. So up to now we always specify the column name. In this case I would also
have done it in one step without assigning a new column by just passing this month attribute directly. So if you pass a series of an array with the same shape it will just directly use those values so you don't always
give it some calculation or an attribute you don't always have to set it first at the column to then point to that column. So this will do exactly the same. Very similar
if you want to look at the typical diurnal pattern so the day pattern how the concentrations vary over the day we can do the same but instead by looking at the average monthly value we can look at the average hourly value. So I can do the same
as here but I use here the hour to group by and then we see this. Now because by default it uses all columns you see here the month column and you can drop that again or select the other columns if you want to avoid nice ones.
And here you see the typical concentrations how they evolve during the day so you see the morning peak and the morning work hour because NO2 and nitrogen dioxide is maybe traffic related so you typically see the morning and evening rush in those concentrations.
Yes, at around 4 o'clock it is 10. But yes, you can also see there are 4 different stations and I'm not fully familiar with the locations but those ones are traffic locations
they are all in Paris you can also see that there are quite some differences whether you are on the street with a lot of traffic or somewhere else in Paris with probably less traffic so there can be a much higher variation between locations than you get online locations during the day.
This last one I'm not going to fully show anymore because it's time to go to peaks but most things you already have seen there is only a small extra thing that I will just show so we use
in this case not the month or hour but the weekday then I want to know I want to subset my data based on whether it's a weekday or a weekend day. So how do I do this?
I create a new column true or false and I use the is-in methods what is the is-in method doing? It will give you true or false whether this value which I just calculated is a weekday number whether it's in this list of 5 and 6 5 and 6 is for Saturday and Sunday
so this will give you true and false values what I can now do is group by two different things up to now we have this group by a single key, a single column but you can group by multiple columns and the groups will be based on all possible combinations of the two columns
so in this case for each, for both weekdays and weekend days I will take the average of all the different hours of the day this will give you so the weekend is not the nicest representation because false means
week and true is weekend we could replace it with nicer labels but you will see that we have now for false all different hours and if you look at the table for example there is the last ones for the true
if you now would like to make a plot of this so what I do here and this is something that I didn't explain but I just want to show the possibility you can reshape your data frame and in this case I use unstack to say ok, if I
only look at the bash station, so I have the false and true values and the different hours which are both my index it's an index with two levels, but I want to rearrange that I have two columns with the true and false so the week and the weekend and my index just the different hours
so to reshape it in this case I can use unstack which now is stacked so this is stacked under each other and I'm going to unstack it like this and now we have just so it goes further up to 23 one column for the week
one column for the weekends and we can then make a nice plot of this this is just to end a bit more advanced but you can nicely see so in the week you have the typical rush hour morning and evening but in the weekends all people in the morning
are still in their bed not in the traffic which I so ok, let's leave it there, so of course it's only a short introduction so there are many things that I didn't talk about, typical things that you want to do with data or you can
for example do in database joins, merges many other methods to reshape your data work with missing data there is a lot of functionality and there is much more I provided here some pointers if you are interested to some more but yeah
there are many good tutorials online and at the end I'll talk to you in a bit bye bye