Neat Analytics with Pandas Indexes - TIB AV-Portal

Neat Analytics with Pandas Indexes

00:00

18

Hendorf, Alexander

Formal Metadata

Title

Neat Analytics with Pandas Indexes

Title of Series

EuroPython 2017

Number of Parts

160

Author

Hendorf, Alexander

License

CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this

Identifiers

10.5446/33665 (DOI)

Publisher

Release Date

Language

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

Neat Analytics with Pandas Indexes [EuroPython 2017 - Talk - 2017-07-12 - Arengo] [Rimini, Italy] Pandas is the Swiss-Multipurpose Knife for Data Analysis in Python. In this talk we will look deeper into how to gain productivity utilising Pandas powerful indexing and make advanced analytics a piece of cake. We will cover: Pandas indexing recap Index Types Time-Series Index and resampling Pandas Multi-Indexin

Speech

Text

Image

00:00

NumberElectronic mailing listSeries (mathematics)Point (geometry)Web pageWavenumberError messageMultiplication signMultiplicationAuthorizationBoolean algebraOpen setType theoryBit rateLecture/Conference

01:48

Position operatorComputer multitaskingInclusion mapAlpha (investment)Design of experimentsSummierbarkeitInterior (topology)Uniqueness quantificationSeries (mathematics)Uniqueness quantificationData structureSeries (mathematics)Poisson-KlammerPosition operatorWeb pageSquare numberMetropolitan area networkGoodness of fitElectronic mailing listResultantVolume (thermodynamics)Program slicingThree-dimensional spaceAlphabet (computer science)MereologyImage resolutionWordXMLProgram flowchart

05:08

Price indexData structureSeries (mathematics)Object (grammar)Dimensional analysisUniqueness quantificationType theoryRevision controlMultiplication signData structureMultiplicationType theoryDimensional analysisSet (mathematics)Series (mathematics)Object (grammar)Row (database)Network topologyComputer animation

06:38

Data structureFrame problemComputer fileBinary fileData structureSeries (mathematics)MultiplicationTwo-dimensional spaceCellular automatonType theoryControl flowDot productSet (mathematics)Frame problemRandomizationRow (database)Adventure gameBitMultiplication signIntegerMoving averageString (computer science)Metropolitan area networkNetwork topologyPhysical lawCASE <Informatik>Order (biology)Graph coloringFormal languageMereologyData typePoint (geometry)LogicBit rateSystem callXML

10:47

Execution unitFreewareoutputSummierbarkeitSeries (mathematics)Presentation of a groupFrame problemDrop (liquid)MultiplicationMessage passingLogicLattice (order)Interior (topology)Row (database)Error messageMathematicsException handlingBitSource codeLevel (video gaming)SubsetSystem callMetropolitan area networkSoftware frameworkNumberOrder (biology)Multiplication signData structureFunctional (mathematics)Quicksort

16:13

Series (mathematics)Data structureMultiplicationTupleCategory of beingHierarchyElectronic mailing listLevel (video gaming)Multiplication signGroup actionArithmetic meanBit rateData typeFormal grammarOrder (biology)State of matterReading (process)Monster groupImage resolutionComputer-assisted translationChemical equationComputer animationXML

20:25

Price indexPlot (narrative)Term (mathematics)Capability Maturity ModelMultiplication signSet (mathematics)Sign (mathematics)ChainObject (grammar)String (computer science)Order (biology)Arithmetic meanPlotterForestFunctional (mathematics)TimestampDefault (computer science)Boolean algebraFile formatStatement (computer science)Group actionRandomizationPoint (geometry)Streaming mediaMereologyVolume (thermodynamics)Mechanism designState of matterCountingLevel (video gaming)Form (programming)Vulnerability (computing)Range (statistics)FrequencyControl flowFrame problemMessage passingOpen setGoodness of fitMultiplicationLengthLambda calculusComputer animationDiagram

26:09

Resampling (statistics)Hill differential equationBitFunctional (mathematics)Resampling (statistics)Maxima and minimaMessage passingFrame problemSlide ruleReading (process)Product (business)Speech synthesisMoment (mathematics)Sampling (statistics)Computer animation

27:33

Resampling (statistics)FrequencyFrame problem1 (number)Multiplication signPresentation of a group

28:04

2 (number)TimestampType theoryMultiplication signBitPhysical lawGroup actionNumberLecture/Conference

28:39

Meeting/Interview

Transcript: English(auto-generated)

00:04

OK, thank you. Thanks for the introduction, Peter. So you already know about this, so let's get straight to the point. Today, I'm going to talk about pandas, but we're already only talking about pandas indexing

00:23

in particular. And the pandas index is a very powerful tool, and I think it's very often skipped in beginners' tutorials to even mention the index. And this is more like a closer look at the index. So we're going to do a little catch up on indexing,

00:41

how we can access data with the index, index types, multi indexes, and a closer look at the daytime index. So the very beginning is just like a little repetition to get everybody on the same page. So pandas is basically built on series. So this is just like a simple example of a series.

01:04

So we just take some random integers and create a data pandas series. And basically, it's like a list or an array with numbers. But one thing you see here, we already have this.

01:21

And this is called the index, and this is like a labeling. So basically, you can say it's not only like a data or a data list in Python. It's already labeled. So it's a NumPy array, actually, here, which we can see at the data type. It's a NumPy array with labels, but I think most of you

01:40

should know that. So how can we access data in a series? And basically, it's like a very Pythonic concept. It's called slicing, so we can just access it by the positional index, just like as we would do with a list. We can do slicing, just as we would do in a list with Python.

02:03

And we can also use the methods for this. The method is called iloc. But note, it's not bracketed, square brackets to slice stuff. And we see that. So this is just like a little warm up.

02:23

But as I mentioned, we already have labels in our index, even in a series, in our series. So here, we can also just go and relabel it. So what we're doing here is actually we are setting the index, which is just like series.index

02:42

as a method. And we just take the alphabet and relabel it. So now we have exchanged our numerical zero index series with just like letters. And we can still access with a position, which

03:03

is like the Pythonic way in a list. But now we can even slice it by labels. And we even like label d to f. So panels will just look here, d and to f, and give the result back. And now already the little for beginners confusing part ends.

03:21

But I guess you probably already rent it. Can we slice by multiple? No, we can't. It's invalid syntax. And how could we solve it? Basically, there's also not another method called concat. So we can just like slice two series and reconcat it again.

03:42

And we have our new series. And one more thing of pandas indexes is usually you might probably think an index to be unique. And pandas indexes are not unique, as we can see here.

04:01

So here we are relabeling our series again. And I just took the word gattaca xyz here to relabel our series. So we have gattaca xyz. And can we use the log method to ask for what's with g? Yes, we can. Can we still slice it?

04:21

In a way, we said, hey, let's give g to a. And no, we can't. Because pandas is only able to use if we basically have here like something unique, subsequent series of unique values, as we can see here. If you list the log method for x to z,

04:40

it works because x, y, z are unique in pandas. So it's really nice and powerful. But that's something you must be really aware of when working with a panda series. So now we know how to access data in a simple pandas index.

05:00

So everybody's on the same page. So what about two-dimensional or three-dimensional data? So let's have a deeper look in their index structure. So as we learned, the label of a series is usually called the index. It's automatically created, if not given, by the data set you're importing

05:21

or however you create your data or add your data to pandas. It can be reset or replaced. As I already demonstrated, it's fairly simple to replace and reset your index. There's also a reset method, which will do all the lifting for you so you don't have to give an explicit index. It may only contain hashable objects,

05:43

which is quite obvious that you cannot put a set or a dig there to basically label stuff. And yeah, it can have one or more dimensions, even the index already. And beware, it's not unique.

06:00

I usually work with unique, but you can also do some fancy stuff with non-unique indexes, which we're not going to cover today. So we have multiple index types. So we just have index, which we just saw was basically just like labels of a series. We have a multi-index, which I'm going to demonstrate later. We have a datetime index, which is actually my favorite.

06:23

And we're going to talk about it a lot. We also have a time delta index, an interval index. And most recently, in the latest pandas version, the categorical index has been added, which can be also very useful. So what's the structure behind all this? The basic, all the ideas with data series and data frame

06:43

is actually borrowed from the R language, which is like the language of statisticians. So the structure is we have some data. Then in case, just like a reminder, of course, the data is, except for strings, it's NumPy under the hood.

07:01

So we have NumPy data types. And it's actually so that the series are also typed. So it's not like in Python. We have an array of multiple types. It's strictly typed. So that's why we're also the performance coming from. And the series is called an NumPy array with labels.

07:25

And what's the data frame? It's basically multiple series basically glued together by having the same index. So note we have multiple series, but we also like these labels there.

07:42

And there's also a three-dimensional structure. It's called panel. But I just want to tell you about it because to tell you, you can actually forget about it because it has just been deprecated. Because basically, you can achieve the cell with multi indexes.

08:01

So it was removed for simplification. So data frame. Basically, it's like two-dimensional data, which is fairly simple actually to imagine. So let's create a new set here. So it's just like a set of random integers. We see our index automatically created indexes back again.

08:22

The same applies for column names. So from the names for each and every series is also referred to as a column. And this is also referred to as a row. So how can we access data in a data frame?

08:40

So I think this is now if we ask for a positional index, we do no longer get the row values. We get the column now. Because the data frame is first indexed by columns. So we get the series out. Of course, we can do the same for slicing.

09:01

But this is, I think, a logical break in the whole pandas API. It's very confusing. Because if you slice, we get rows, which I think is a break. But once you get used to it, it's still handy. And we can also use the ILOG method, for example, to even slice just like a part out of our data frame.

09:23

So this is like the 0-axis and the 1-x-axis 1. So we can just use the ILOG method to slice out a segment out of our data frame, which is really handy if we have bigger data frames.

09:42

So let's continue our adventure here. And what if you want to slice two columns? I see it's very simple. We just use like we just pass in two dots to say, OK, as in Python, take the whole array

10:01

and can also just ask for the column. And sometimes it's a little bit confusing, all this access stuff. And I really had a hard time to remember when I was new to pandas. And actually, I stumbled across a nice what we called Esusbrücke in German, which is just like something to help you to remind stuff.

10:20

So axis 0 is horizontal. And axis 1 is vertical. And it's fairly easy to remember, because 1, it looks just like a 1. So this is just like this is how I basically remembered it. Because I'm also one of these guys, always like, left, right, right, left.

10:40

I'm sorry. So OK, here. So let's go further. I'm really having trouble reading here. Let me reconfigure my set a bit. Sorry.

11:19

OK, let's relabel our index and the columns.

11:23

And it's fairly simple as demonstrated before. Here, I'm just passing in a method, a function, to rename our rows and columns, just like by R starting with leading 0. So it's a little bit more memorable than just working

11:41

with numbers. And of course, we can still now access the rows, just like as before. If each pass asks for row C05, we get fifth column. Well, basically, the sixth one. The same, of course, applies for accessing the rows and the same for accessing the segments.

12:02

So this is just like the same logic as I applied before, just by positional values. And how can we now add data to Pandas? Basically, because we sometimes have data,

12:20

or often have data from multiple sources. And how can, actually, Pandas help us gluing together data from multiple sources? And actually, here, the index becomes really handy. Because for example, if we add here, we're doing just like we're just adding a new series.

12:42

It's called C10. And Peter, I lost my timer. Sorry. So what do we have here? I create a new series. And you see the labeling is a little bit off.

13:01

And we just add it to our data frame, which is already in place. So we just say, OK, data frame, please add a new column called C10. And we pass in the data frame, the series we created here. And you see, we end up with nan values here. And we also miss labels.

13:23

Because the index just does not match. So this can be really handy for joining multiple data frames, for example. Because we can also be more explicit about how we want to join the data. Because here, we just do the same.

13:43

And we just say, how to join? This is like the same logic as this from SQL databases. So we just ask for inner join. And then we only get the subset, basically, where both indexes match. And basically, the rest is just like dropped out

14:02

of our data frame. And of course, if you apply something like that, Pandas always returns a copy of the data. So basically, if you want to keep this structure, you have to store it in a new variable or just overwrite the variable you're working with or sometimes forgotten.

14:21

What else can we do? Of course, we also can do an outer joins. And here, I'm using another really handy method, which is called inplace true. Because inplace just instructs, apply the changes to the data frame we're currently working on. So for an outer join, basically, we just say,

14:42

hey, join everything. We receive everything. And everything where we have no values, Pandas automatically adds NaN values.

15:00

And there's another really handy thing. We can also just instruct to say to ignore errors if we want to join. And something throws an exception. So this is a nice example here. So how do we get rid of data?

15:24

So we can use the drop method. Basically, I want to get rid of this column. And of course, we could just slice a column. But what if you just want the third column, the fifth, the 10th, the 20th column? You could be just asked explicitly and basically join the data later.

15:41

But the drop method is much more handy. So I just want to get rid of the newly created column 10. So we ask Pandas just drop this. But if we don't put ignore there, it will throw an error because it might not be present. So the ignoring errors can be really handy

16:02

if you are not sure whether there's a column in your, there's a present. So let's go to the multi-index. The multi-index is basically also like a fairly simple data structure I want to introduce you to,

16:23

or index structure. So now we have a little different data set. And so basically this is just like, we just have some, we just create, it's like we could imagine it's like hotel prices. So we have a city, there's a price,

16:41

there's a certain rating, and the city is located in some country. So this is just like a fairly easy data set. And actually, so we have some major cities here and my hometown, Mannheim as well. I was free to add that. And now let's see what we can do with that.

17:03

Well, we group because many people are not aware if you do a group by in Pandas, you actually get back a multi-index. And for example, so this is like the group by and we ask for the mean and Pandas will just go, as you probably know already,

17:21

take all the data types where you can actually make a mean off. And so we see these are just like the rating. And we already see like a hierarchical structure here. So we ask to group by country, city, and category and we pass this in as a list and Pandas will just create

17:41

this hierarchical data structure. In the same order, we do the grouping by. So we have the country, the city, and then the category and are the mean values we were working for. And this looks really nicely but it might be a little bit confusing.

18:00

How can I access basically, for example, if I'm interested in getting the data from the cities? So of course you could ask for these values and basically work down the path of the hierarchical index. But we can access it basically a little bit better. So let's have a closer look on the index.

18:24

It's really easy to look into the index and Pandas just by asking by .index. So what do we have here? We have vc, we have a multi index and it also indicates we have levels. So we have like three lists which are all basically,

18:41

thank you, kind of the new levels. And we can also ask for the index level. So it's really easy to look into the data by level. We also can ask just like for the names back again. So in Splendid Pandas is very explicit, what's stored. And we can also ask for the index values. And here you see how the multi index actually works

19:02

because actually it's just like tuples. It's just like tuples of the city, sorry, country, city, and the category we were looking for and this is fairly simple. And actually it's a very simple structure we can work in our minds to get to the data.

19:20

So and we also can directly access the data by just asking for the values by level. So here we are just asking to give us back all the data or the values we have on level two. And the same applies for level one. So I think this is fairly simple.

19:42

This is just like two more examples. We can, because we can also just like use the locks method to ask for all the data which is stored on the first level here, country. So here I'm just asking, hey, please give me back all the data which are in the country for Germany.

20:02

And we even can just like access the hierarchical and just also like by passing in the list through and basically panels will just go and match the list we passed into the tuples it has stored in the multi index and so that's fairly simple basically to access the data you want.

20:23

So I really want to spend some time talking about my favorite index which is the daytime index because there's a lot of data, basically almost all data has some timestamp on it. And let me introduce to you our data set for this little exercise.

20:40

It's a fairly simple data set. It's just like a timestamp and a temperature value which is taken from an open data set from the city of Aarhus. And this is how the data looks like. This is when we just plot the data as it comes in. And let's create our data frame, the tight time index which is fairly simple.

21:03

We can just use the to date time method which is built in in pandas and it's actually there to pass date time values. It does most of the heavy lifting for you but you can also be very explicit how your date time like the date time string is structured.

21:23

So we just rely on this format here. This is like the default format. And yep, now we have an index. And what are our discoveries once we have created the index? If we just like do the same plot again. And we say, oh wow, this is like really

21:40

well going up and down fairly random. And now we see, okay, this looks more like a time stream of temperature values across multiple days. And you see like that's one of the great things about pandas that everything works really well together. So we don't need to instruct anything in matplotlib here

22:02

to how we want to present our data in which order. Like once we have a date time index, pandas does the heavy lifting for us in matplotlib. So what else can we do? We can, yeah, this is just like a closer look on the date time index. So you see actually time stamps here.

22:22

You can also notice, so we have a date time, it's a time stamp. That's the name of the index. This is the value count, it's the length. And we saw also the date time index also supports frequencies, which we're not going to cover today. But it's also like fairly, fairly neat to work with frequencies in pandas.

22:43

So let's group by the, just like take the data from we have in the index, which is like time stamps. And use the index for grouping the data. And just like let's count. And if you already see here, I'm just not asking for the index as such,

23:01

which is just like one second granularity. We are just like, oh, index date. And it's already built in. So we can easily group, use the date time index to group data by days without doing anything. And we can also do something.

23:22

There's also like the week. And we can just like basically chain the methods here and say the mean and plot it. Thank you. And we can also use the index to ask for what are like weekdays and what is weekends.

23:40

This is also a little logical break, my opinion in pandas. For example, if we pass date time objects, it's very friendly to US states. And as you of course know, the US is the only country with month, day, year, which can be really troublesome. But for example, for here pandas is zero index,

24:04

but zero is Monday. But so in the US it should be like Sunday. So this is like the more European way to count weekdays. So what are we doing here? We are just like getting data from the index

24:21

and then we just use the Boolean index, just ask okay for which days five and six. And so we get the weekends back and then we just like do everything together and just like ask for the hour of the data we have basically combined here. And so we can actually find in our data set

24:40

that the temperature at least at weekends is higher, which I think is a good sign if you live in Denmark. So these are probably sunny weekends, but it's a small data set, it has no significance. What else can we do? We can also just like ask for a date and we can just pass in like a whole month here

25:02

as a string. So this is just like a year and month and get the temperature plotted so it's a very, very powerful index. So it really saves you a lot of time like making up your mind, okay, what do I want to pull for lambda functions or like anything, anything. Once you have a daytime index,

25:20

basically daytime is at your fingertips. So what else? We can also ask for ranges, just like a little slicing by dates, which is also I think pretty neat and very useful. And this is probably not as useful, but just like to show you a little, we can also like just ask for the hour of the index

25:41

and just ask for, we can basically make a, like ask for this is like an end statement. So we ask for all the data in our data frame where the hour was greater than like 12 o'clock and but it was just like until like 1600 hours. So and then we shouldn't just plot it.

26:01

I don't know whether it's gonna be used for it, but I think it's good. And of course, once you have the data with the daytime index, you can also do resampling which is like super cool. So let's do resample a little bit. So here's our data set. We just pass in the resample method and pass in D.

26:21

D is basically resampled by day. And then we can aggregate data or like ask, okay, what's the maximum? And we immediately get back the maximum values back for each and every day, but in a resample fashion. So and we can use,

26:40

you will do the same and just ask by month, which is M, which is quite accessible. You can also resample our data frame by day and ask for an aggregate, which is also like a really,

27:00

act function is also really handy because we can ask for minimum and the maximum and just like plot it. So this is like the minimum maximum values each and every day. And the last and most useful thing for resampling I want to show you is actually, we can even resample by three days. So basically, we're very flexible on the intervals you can sample.

27:21

So if you want to have like three days, one day, something 12 hours, 11 hours, anything, this is super flexible. I thought it was a little bit hard to find actually, what is what. So let me present you this slide with all how you can resample. So I've taken the freedom, the ones I've found most useful

27:40

to put them on the left side. But Pandas actually was developed at the hatch front, where Wes McKinley was working at a hatch front when he was starting to use Pandas. So you have a lot of like business time frames there as well. And so basically, you can resample by anything you probably can think of.

28:01

And that's the end of my little presentation. Thank you very much for your attention. So thank you, Alexander. I think we have one quick question. Otherwise, okay.

28:22

Just a question, this timestamp is limited to NumPy daytime type is based on nanoseconds. Do you have any idea if you can use a different frame? Because I would like to have spent a bigger time and I don't care about a nanosecond. Second would be totally fine. Do you have any idea of this possible? No, actually, actually, I was really happy with the daytime solo.

28:42

No, I have never stumbled across it. But let me know if you find something. Okay, thanks again.

Recommendations

1:57:48

Indexes (14.07.2011)

1:48:49

Indexes (18.11.2010)

28:54

Improve your indexes

1:32:26

Advanced Pandas

27:23

From Pandas to production: ELT with dlt

23:13

Extending Ember with Analytics

49:43

Streaming Analytics with PostgreSQL

22:50

Introduction to Pandas

1:31:40

Introduction to Pandas

42:32

Faster And Durable Hash Indexes