Neat Analytics with Pandas Indexes

Video in TIB AV-Portal: Neat Analytics with Pandas Indexes

Formal Metadata

Neat Analytics with Pandas Indexes
Title of Series
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Release Date

Content Metadata

Subject Area
Neat Analytics with Pandas Indexes [EuroPython 2017 - Talk - 2017-07-12 - Arengo] [Rimini, Italy] Pandas is the Swiss-Multipurpose Knife for Data Analysis in Python. In this talk we will look deeper into how to gain productivity utilising Pandas powerful indexing and make advanced analytics a piece of cake. We will cover: Pandas indexing recap Index Types Time-Series Index and resampling Pandas Multi-Indexin
Boolean algebra Point (geometry) Web page Multiplication Multiplication sign Electronic mailing list Open set Number Type theory Bit rate Authorization Series (mathematics) Error message Wavenumber
Web page Multiplication sign Set (mathematics) Price index Dimensional analysis Revision control Goodness of fit Type theory Object (grammar) Square number Uniqueness quantification Computer multitasking Series (mathematics) Data structure Summierbarkeit Metropolitan area network Position operator Multiplication Poisson-Klammer Uniqueness quantification Interior (topology) Electronic mailing list Dimensional analysis Volume (thermodynamics) Position operator Inclusion map Type theory Series (mathematics) Network topology Object (grammar) Alpha (investment) Data structure Resultant Row (database) Design of experiments
Point (geometry) Frame problem Randomization Multiplication sign Control flow Set (mathematics) Mereology Graph coloring Formal language Bit rate String (computer science) Integer Data structure Series (mathematics) Metropolitan area network Adventure game Multiplication Cellular automaton Computer file Physical law Bit Two-dimensional space Binary file System call Frame problem Type theory Personal digital assistant Logic Network topology Order (biology) Moving average Data type Data structure
Functional (mathematics) Freeware Multiplication sign Source code Drop (liquid) Number Subset Mathematics Software framework Series (mathematics) Data structure output Summierbarkeit Error message Metropolitan area network Exception handling Execution unit Multiplication Bit System call Frame problem Logic Order (biology) Quicksort Row (database)
Multiplication Group action State of matter Chemical equation Image resolution Multiplication sign Electronic mailing list Category of being Bit rate Series (mathematics) Hierarchy Order (biology) Formal grammar Energy level Data structure Computer-assisted translation Data type Monster group Reading (process) Tuple
Point (geometry) Group action Randomization Functional (mathematics) State of matter Plotter Multiplication sign Range (statistics) Control flow Set (mathematics) Price index Streaming media Mereology Plot (narrative) Frequency Mechanism design Sign (mathematics) String (computer science) Forest Energy level Form (programming) Vulnerability (computing) Boolean algebra Default (computer science) Resampling (statistics) File format Counting Volume (thermodynamics) Term (mathematics) Timestamp Frame problem Arithmetic mean Order (biology) Chain Statement (computer science) Object (grammar) Capability Maturity Model
Functional (mathematics) Presentation of a group Resampling (statistics) Multiplication sign Moment (mathematics) Sampling (statistics) 1 (number) Maxima and minima Bit Resampling (statistics) Frame problem Product (business) Message passing Frequency Speech synthesis Hill differential equation Reading (process)
Group action Multiplication sign Physical law Bit Number
OK thank you thankful interaction either on so all you already know about this so let's get to the point on today I'm going to talk about about analysts of authority only talking about pandas indexing in particular and the time index is a very powerful tool and I think it's very often skip the beginners tutorials even mentioned that well thank you to mention the index and this is more like a closer look at the index consul we're going to do a little catch up on the indexing and how we can access data on with the index index types multi indexes in a closer look at the time index so at the very beginning is just like a RePet repetition to get everybody on the same page so on and is basically built on series on so all this is just like a simple example of serious so we just take some random integers and make up some and the created data openness at a serious and basically it's like a list on board an array of numbers but 1 thing you see here we already have this and this is called the index and this is like the labeling so is he can say it's all like a data or list and not on a Python it's it's already labeled obsoleted num pi array actually here which we can see that they 2 types of harm so that the error rate with labels what I think most of you should know that so why and how can we axes on they are in the series
and basically it's like over funny concept we can just called slicing so we can just at the axis of by of the positional a nexus like we as we would do we list and we can do slicing just as we would do when I was with my and and we can also on use the method so there's that so 1 as the methods called IOC and but note it's not on bracketed square brackets but to slice them and the CIA were sold this is just like a little more more lot as I mentioned we already have labels in our index even in a series of an or serious so along here we can also just goal and we label it so what we're doing here is actually on the URIs setting the index which is like a series of annexes the method and we just all take off of that like often and and relabel it on so now we have exchanged our around the miracle of 0 index Sears was like just like letters and we can still access volume the position like with the position which is like a path of equity analyst on but now we can even likes classified labels and the 1st and we even get it like labeled E. 2 F. soul panels will just looking at the draft and give the result back and already already the move for beginners from confusing contents good because you probably oriented on can be sliced by multiple no we can't it's symbolic some text on can we all and out can of the soul it basically does also not another method called concat so we can just like series and country encountered again and we have on users the soul and 1 more thing of panics indexes there's a usually you your you might of probably think man index to be unique and commas indexes are not unique as we can see here on so here we are relabeling our series again and I just took the work get a co XYZ here to relabel a serious so we have that occur X Y Z and can use a lot methods to ask for all what's with G yes we can and can be slow flies sit on the way we set the is the G 2 8 and now we can't because and this is only able to use if we basically have here like on something unique subsequent series of unique well used as we can see here on few semantic the up method for X tools that it works because x y z all unique can soul is really nice and call full but that's something you should we must be really aware of when working with of and a series sold on now we know how to access data on a simple panels Anexo everybody's on the same page so what about two-dimensional or three-dimensional data and so let's have a deeper look in the index structure
sold on the as we learned like the label of a series is usually called the index on its tree automatically traded if not given by the dataset you're importing or however you create your data or your data to as it can be reset or replaced as orally demonstrated it's fairly simple to replace and reset your index also record set method which will dual lifting for you so you don't have to give an explicit and next if k in may only that content hashable object was typed quite obvious that you cannot put a set or dig their tools and layer 2 on the basically label stuff and the then it can have more more more dimensions in the main that's already and the where it's not unique usually I usually work unique but you can also do some fancy stuff with nonunique indexes which we're not going to cover today so long we have multiple index types so we just have indexed which we just saw was basically just like labeling of labels of a serious we have a multi index in which I'm going to demonstrate later we have dates time index which is actually my favorite found when we going to talk about a lot on we also have the time delta indexes and interval next and most recently analyzes panels version the categorical index has been added which can be also very useful soul what's the structure behind all this the the basic all the
ideas of data series and data frame is actually borrowed from the OWL language which is like the language of statisticians on the structure is they have some data the and case law just like a man of course like the data is except for strings it's a it's a non point on on the village so we have not quite data types on and it's not really so that the the series are also found that the tight so it's not like in Python we have an array of multiple types it's it's it's a strictly types so that's why they're aware of so called the performance coming from and the series is called an umpire rate with labels so and what so their frame it basically yeah of multiple that series the basically glued together by having the same index salt answer note we have multiple series but we'll also like these labels there and there's also a three-dimensional structure it's called panel but I just want told you about it because to tell you you can actually feel bad about it because it has just been deprecated on because basically you can achieve the cell my multi them indexes solar and it was removed from for simplification so data from basically it's like two-dimensional data which is still fairly simple actually to imagine on so let's create a new set here so it's just like a set of random integers we see on are annexed quarterly critique ultimately tree indexes back again on the same applies for a color names so from names so for each and every series is also referred to as a column and the this is also referred to as the rope not always so the so How can we access they don't have a friend so I think this is now if we ask for positional index we do not longer to get that we don't get the roll values we get the call now because the data-frame 1st indexed columns so we get the series out of course we can do the same for slicing but this is I think the logic breaking in the whole pay a premise API it's very confusing because if you slice we address which I think is a break but once you get used to it depending on land and we can also use the island method for example to even just like part of our data from so this is like the 0 axis and of 1 xx is 1 on so we can just like use all the other methods to slice all the 2nd out of order frame which is really handy if with the of friends so let's continue your major adventure here and what if you want to slice 2 columns because these very simple learning we just use like we just pass and like all the than the 2 adults to all say OK known as in Python take the whole array on and can also just ask for the column and sometimes on it's a a little bit confusing all this axis stuff and I really had a hard time to remember and I was new to pandas and actually I stumbled across the demise of the recall is working in German which is just like something to help you to remind stuff so
x is 0 is horizontal and the axes 1 this vertical and it's fairly easy to remember because 1 it looks just like so this is like this is basically remembered because I am also 1 of these guys like always go left right right left from sorry and and here so on let's
go further and really have trouble reading on to me that we reconfigure my set the story so OK tun let's relabel and our index and the columns and it's fairly simple as demonstrated before we can on here I'm just passing and our method function just like but to rename all rows and columns just like our starting this leading zeros so so we can now that it's a little bit more memorable than just looking at numbers and of course we can still now axis rose just like as before with each passing last full row of C O 5 we get efficient call what basically the 6 1 the same of course applies for accessing the rows and the same for accessing the segments so this is just like the same logic as applied before just by positional values and how can we know at stage 2 this on basically because we use sometimes helpful data more often have data from multiple sources and how can actually powerless helpless going to get the own the sort of data from multiple sources and actually here the index becomes really handy because for example if we add here on we're doing just like which isn't just adding a new series on there it's so called see 10 and something lot of my time at 3rd and so on what we have here they create a new series and you see like the labeling a little bit off and on beaches like added to or a data frame which is already in place so I just use just say OK the Avenue all column called see 10 and we pass their friend Lillis's we add the craters here on and you see we end up with like a man values here and so and we also missed labels because the index just does not match so this is but this can be really handy for like of joining multiple our data frames for example because we can also be like really would more explicit about how we want to join the data on because here we just do the same and we just like how to join on this the same logic as as from SQL databases so we just as as we ask for drawing and then we only get the subset basically we're both indexes match and basically the rest is just like drop out of order and of course if you apply something that that is always returns a copy of the data so basically if you want to to keep this structure you have to store it in new wearable or just like all right the very you working with sometimes forgotten what else can we do well of course we also can do altered joints and here on amusing I another really handy method which is called in place true because in place just instruct apply the changes to the data framework currently working on and so on for knowledge or and basically we just say a John everything on we receive everything and everything where we have no well use that automatically adds manual use the so and there's another really handy thing we can also just like the instructors say to ignore errors on if we want to join in something follows an exception soul this like this my example yeah so well and how do we get rid of data so we can use the drop method on basically and I wanna get rid of this column and of course we could just like slicer but what if you just like what the 3rd column the 5th the 10 20 column and you could could be just like explicitly and to have basically joined 2 days later but it's so that the drop method this not much more handy soul I just wanna get rid of the newly created content soul i we asked just like drop this but if we don't to ignore there it will throw an error on and because it might not be present so long that ignoring errors can be really handy if you are not sure whether it's a column in Europe is at present on it so
on let's go to the multi next on the multi indexes basically also like a fairly simple data structure I want to introduce you to index
structure on soul now we have a little different dataset and so basically this is just like the 1 we should have some more on features traits it's like you could imagine Foucault prices on so we have a city farm does a price there's a certain rating and it's the city is located under some countries so this is just like a fairly easy dataset and mn actually also so we have some major cities here and my hint on 1 hand so I will state of free to that and now let's see what we can do with that well we group because many people are not aware if you do a group why you not intended as you actually get back a multi index and for example so this is like the group by and we ask for the and grammar will goal is to promote already uptake although all that data types where you can actually make me off and so we see these are just like the reading and you already see like a hierarchical structure here so we asked to group I part country city and to Geary and we passed as in as a list and append will just create this state hierarchical data structure on in the same order we do the grouping by and so we have the country of the city and then on the cat egory on and our the mean values we were looking for and this really nicely but it might be a little confusing how can access basically for example if I'm interested in getting the data from the cities on solve course you could ask for these values and basically walk down the path of the hierarchical index body armor become absolutely acts of asset excess basically will be better so let's have a closer look on the index well which we actually it's really easy to look into the index of founders just by asking global index so what we have here we have the see we have a multi on index and it also indicates we have levels so we have like 3 lists which are all the basically thank you on a candidate in the adults and we can also ask for the index level so it's really easy to look into the data by level and we also can just like 4 the names back again suspended balances the expressive will store and we can also ask for the index values and you see how the world alternates actually works because actually it's just like tuples monsters like tuples off the city uh there's a a country city and the category are we're we're looking for an this is fairly simple and actually the very simple and structure we can work in our minds to get the data so and we also can directly axes on do the data by just asking for and the the well used by level so here we are just asking to people's back all the data the border values we have on level 2 and on the same applies full level 1 so I think I think is fairly simple and this is just like 2 more examples and we can because we can also on just like use locks the method and to ask for all the data which is stored on the 1st level here country so here I'm just asking him please me back all of the data which are in the country before of Germany and we even can just like access the the hierarchical injustice also like but passing analysts who from would basically banners would just go and matched to the list we pass into the tuples that are stored in the multi index and so it's very simple basically to access the data you want in fact so on I really want to take spend some time
talking about my favorite index adjust the daytime index discussed gives a lot of data is the more almost all data has some time standpoint and let me introduce you
all datasets produced little exercise is very simple is that it's just like them the timestamp at a temperature volume which is taken from an open dataset from the city of our house and this is how the data looks like on this is what we just plot the data as it comes in and it's great or data frame that the tight time index which is so fairly simple we can just use the move to dates time that which is building upon those and it's actually there to was passed they time on values and it does most of the heavy lifting for you but you can also be very explicit how your date time on that like the daytime string what is structured and so we just relying on this form here like the default format and yet you know we have an index and what our or discoveries means we have created the index if we just like to the same plot against cancel out of this is so this is like really well but the book going up and down Ravelli random and now we see OK this looks more like stream on temperature as across multiple days and you see you like this 1 of the great things about pandas that everything's works really well together so we don't need to instruct anything in matplotlib here to how we want to represent all in which order like once we have on the daytime index this let the lifting forest and not properly so what else can we do on we cannot get is like a closer look on the daytime annexed sleazy actually timestamps here on out you can also notice also so we have a lot of data on which the timestamp that's the name of the game next and this is the the value counters Lankan we so also sold the data and it's also supports frequencies which we're not going to cover today but it was also like fairly very needs to work with frequencies of pants so let's go boy the just like take that data from we have to we have in index which is like timestamps and um use the index for grouping the data and just like let's count and if you are a C here we and just not asking for the index essentially is just like the 1 of the 2nd grandmother again larity we I I just at just like all index state and it's already building so we can easily group use the time index to group data points based about doing anything and we can also use something so there's also like the week and we can just like basically chain the methods here and say that the mean and plot it thank you and we can also use on the index to ask for what our like weekdays oral and what is weekends on this is also a little logical break uh my opinion as for example if you all for past daytime objects is very friendly to US states and as you of course know on the US is the only country will this month day year which can be really troublesome and uh blood for example for here this is non 0 index but zeros Monday but so in the US it should be like something so this is like you more at European level will wait to on account of weak base so what are we doing here we're just like from getting data from the index and then we just use on the Boolean Boolean annex just ask OK which they 5 and 6 and so we get the back and we just like do everything together and just like asked for the hour of the data we have basically no combined here and so we can actually find in our data set that the temperature at least every cancers higher which I think is a good sign if you live in Denmark and so these are probably some hindered weakens but it's more than is that it has no significance what else can we do we can also just like I asked for a a date and we can just pass like on the whole month here on the string so this is just like the year and month and that the temperature recorded sold it's a very very colorful index so it had it usually says he'll of time by like making up in 1 OK you would do we want to hold for lambda functions are like anything anything go once you've at the time index basically they time is at your fingertips on so what else on the can also ask for ranges just like you will slicing by dates which is also I think pretty neat and very useful and this is probably not as useful but I just like to show you a large we can also like just ask for that hour of the index just ask for become basically make up like last for his like and statement so we asked for all the data in order to frame where the our was greater than like poll Clark and but alone it also on just like that to like 60 60 hours so on and engines as part of a use for it but I think and of course once you have on data from the mechanics you
can also do resampling which is like super cool and so that's the resampling but so
he is already a dataset on the just pass in read resample method and Parsons D and these basically we sampled by day and then we can be it's created Arrakis OK with a maximum and we immediately get back the maximum on emollients back for each and every day but reading resample fashion and so on and we can use so you will do the same and the just ask for a moment which is an Scott accessible you can also resample the OData framed by day and ask for a by aggregating which is also the at functions also really handy because we can ask for minimum and the maximum and just like product so this is like the minimum and maximum values speech and read it and the last and most useful thing for a something I want to show you is actually the can even resample by 3 days so basically River flexible on that bill intervals you can sample so if you want to have like 3 days 1 day something 12 hours 11 hours anything this is super flexible I thought it was a little bit hard to find actually from what is what it's all about the me present you with like with all
how you can resample so I have taken the freedom the ones that found most useful to put them on the left side of Pandas actually was developed at the hatch from livers McKinley was working on that hatch from when he was starting to use so we have all lot of light business on time frames there as well and so on so basically you can resample by anything you probably can thank God and on and that's the end of my little presentation thank you very much for your attention
few was thinking examine within group when you will the question a lysosome to success and this this constant is limited to the number of times the time is based the nanoseconds definitely you can use different from because I would like to spend a bit of time and given a 2nd thing you would want to have any of you universal law actually actually I was really happy with that they have so little found no I have never stumbled across but let me know if you find something the things again it