Pandas - not just for data scientists

Video in TIB AV-Portal: Pandas - not just for data scientists

Formal Metadata

Pandas - not just for data scientists
Title of Series
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Release Date

Content Metadata

Subject Area
Pandas - not just for data scientists [EuroPython 2017 - Talk - 2017-07-14 - Anfiteatro 1] [Rimini, Italy] This is not a tutorial. It's an attempt to expose non data scientists experienced pythonistas to the powerful pandas library. Most of python developers don't use pandas (either because they never heard of it, felt that it's a too steep learning curve or never thought that it will be useful for them). I intend to talk about python performance limitations and show how pandas can be used to overcome some of these limitations. The talk will be accompanied by a live Jupiter Notebook session that will demonstrate a typical use of pandas
Arm Urinary bladder Software developer Quicksort Line (geometry)
Integrated development environment Mathematical analysis
Demo (music) Software developer Different (Kate Ryan album) Phase transition Multiplication sign Software developer Computer programming Data analysis Maxima and minima Computing platform Formal language
Point (geometry) Existential quantification Software developer Line (geometry) Decision theory Connectivity (graph theory) Source code Menu (computing) Mereology Vector potential Physical system Computer architecture Machine learning Programming language Software developer Interface (computing) Projective plane Formal language Virtual machine Process (computing) Computer configuration Architecture Order (biology) Interface (computing) Code refactoring Endliche Modelltheorie
Focus (optics) Software developer Software developer Structural load Multiplication sign Weight Virtual machine Limit (category theory) Formal language Virtual machine Web 2.0 Computer configuration Computer configuration Interface (computing)
Group action Code Mereology Total S.A. Formal language Usability Semiconductor memory Computer configuration Befehlsprozessor Series (mathematics) 9 (number) Area Mereology Formal language Element (mathematics) Type theory Series (mathematics) Right angle Physical system Point (geometry) Frame problem Implementation Functional (mathematics) Overhead (computing) Letterpress printing Electronic mailing list Entire function Element (mathematics) Number Revision control Natural number Operator (mathematics) Data structure Implementation Analytic continuation Mathematical optimization Operations research Execution unit Standard deviation Weight Code Dimensional analysis Planning Computer network Frame problem Integrated development environment Personal digital assistant Query language Calculation Revision control Interpreter (computing) Object (grammar) Library (computing)
Dialect Pi Demo (music)
Direction (geometry) Multiplication sign Coma Berenices Mereology MKS system of units Different (Kate Ryan album) Object (grammar) Series (mathematics) Sampling (statistics) Electronic mailing list Bit Product (business) Type theory output Configuration space Right angle Simulation Resultant Row (database) Point (geometry) Dialect Computer file Maxima and minima Time series Augmented reality Rule of inference Field (computer science) Number Computer programming String (computer science) Integer Data structure Maß <Mathematik> Mathematical optimization Boolean algebra Information management Information Sine Cellular automaton Forcing (mathematics) Weight Frame problem Subject indexing Urinary bladder Personal digital assistant Normed vector space Object (grammar) Electronic visual display Hydraulic jump
Histogram Presentation of a group Functional (mathematics) Statistics Sequel Computer-generated imagery Range (statistics) MIDI 1 (number) Maxima and minima Set (mathematics) Rule of inference Number Product (business) Frequency Latent heat Different (Kate Ryan album) Average Bridging (networking) Computer worm Data structure Plug-in (computing) Condition number Salem, Illinois Gender Sampling (statistics) Counting Menu (computing) Pivot element Discounts and allowances Inclusion map Mathematics Visualization (computer graphics) Query language Personal digital assistant Normed vector space Right angle Table (information) Simulation Row (database)
Observational study Befehlsprozessor Filter <Informatik> Summierbarkeit
Group action Code Multiplication sign Mereology Perspective (visual) Mathematics Synchronization Befehlsprozessor Logic Process (computing) Category of being Probability density function Curve Mapping View (database) Software developer Range (statistics) Data analysis 3 (number) Category of being Process (computing) Order (biology) Right angle Filter <Informatik> Quicksort Abelian category Resultant Laptop Directed graph Row (database) Point (geometry) Implementation Functional (mathematics) Product (business) 2 (number) Twitter Operator (mathematics) Integrated development environment Software testing Data structure Implementation Summierbarkeit Traffic reporting Pairwise comparison Multiplication Dialect Information Weight Mathematical analysis Code Frame problem Word Integrated development environment Logic Personal digital assistant Finite difference Calculation Synchronization Object-relational mapping Iteration Mathematical optimization
Complex (psychology) Addition Server (computing) Presentation of a group Overhead (computing) Sequel Multiplication sign Correlation and dependence Revision control Programmschleife Query language Personal digital assistant Logic Computer configuration Right angle Logic gate
1 1 arm enemy Susie and I'm here to talk with you about there's some line in the Stoke Bantustan just for that the scientists in some I have a feeling that indep either community that the scientists use it to almost exclusively and it's of do you recognize it is a very powerful tool and steal other developers know that the scientists or sort of some of the most people that I not hardly familiar with this tool and I have a feeling that's if more people will know about it outside of the science community there could be this could be very useful to for them as well I so before start I have 3 questions for you so please raise your hand if you heard of fundus before something to many
of you and keep it up if you played with 8 of something that Jupiter familiar with and the last 1 keep your hand up if you use it in production environment OK that's all with the folks so so just to be clear and In this talk
is about and get survey if so the stock is not
for that the scientist it is definitely not a tutorial at 30 minutes talk about which we go through other subjects to see connotes given starts cover the minimum that you need for the for the upon this I do the here 2 references that you can use if you feel that time this could be useful for you and that's my intention my intentions by the end of the talk will fill that's potentially a a valuable tool for you so you go to these references and the deeper but I'll have a short demo to walk you through some of the basic stuff because I have a feeling that sometimes when you have like a quick start and you just understand what are the basics it's makes it much easier from so yeah
and basically hopefully you're developers of an up to date the scientists and and that's my intention so about me I have 30 + years of experience with different platforms and languages but in the last 6 years of exclusively by tonic comes fell in love in this language and but almost returned my joy of programming that I had in my youth and I'm very grateful for that so so I want to prove
find always think the company and we basically give loans to small businesses in the US on part of the major components that we have in our system is the ability to approval rejects new customers new deals and for that we have a very talented on the most other scientists who was doing great job in doing that in almost impossible timeframe something's in minutes using all the data sources that we have and as part of my previous role as the chief architect very worked a lot with the team and I had to learn pundits in order to work with them to help them with their architecture design decisions Python and when I learned this more and more I realized that the other projects I work on with other developers can actually benefit from this and this made us start using pundits also outside of the room off that defines of you would have points and we saw a lot of benefits from that and I'm going to show some of here but when it comes to lot of
programming languages menu conceived as an interface between human developer and the machine and and for me Python is probably the best
option on the side of the human developer caffeine when I developing pundits that the cognitive for the load is the minimal I can have when programming but on the machine side of this of this reality
we all know that item has its limits and most of the time we don't really care about this limits right if you have I web out for a simple web about we don't really care about this but there are situations where this limit can be really problematic and for this I know we tend to asses the focus of what we do how do we use ponders this so use Python and still work out these limitations so 1 of the things
that I think we all know is that if we use a specialized to bite future we the gets much better performances so for example in this example you see on the left side you see for an hour and in implementation of a for loop and as we all know for is like the most open and versatile our tool that Biden gives us for this ability to look for things and we know that this comprehension which in a way to limit the things we can do this more specialized In if you use that we use you see here we gain quite the a meaningful a advantage in performance right and it's about 30 % here and we usually use the specialized features not just because of performances because the sodium pneumatic by phonic where fighting is also make the court clearer and easier to follow but then the thing we can do we can leverage the advantage of CN In and Python kind of refine this for this a way to extend this optimize specialist solution but implementing part of the code in C and so on and it and adjusting what have here from from my of my talk I heard the frogs is they actually sometimes we think that C is a better option because it's a compile language but does he need very nicely shows his talk the nature of the of but performance in Python country comes from the dynamic nature of the language not from the interpretation part so we show very nicely that the importation maybe can be our own maybe can be of be responsible for 30 per cent overhead but the dynamic nature of the language is actually something that we around in his spoke shows like around 600 % of overhead and performance and so as you know a lot of I libraries already take advantage of this ability and so some on and also part of the standard library in Python of course is written in C and the non plane fundus our our some of these libraries and and that bond this is just to give you really have a short introduction so Caesar in is the you printing of Ponders's as a kind of an excel it's really a bad metaphor to have going forward but it's like a nice our starting point so everything you can do with Excel or maybe you know the environment we have that the basin and and screw query on you can do with ponderous this is based on on num pi and I provides a very efficient array implementation that is fixed size and creation of each element has the same type and the most number in Kunduz providing what they call you find which are vectorized versions of many useful operations you can run over these areas and that the frames and some if someone through live with our the frames of course appendices is very similar to that action based on that so so how can we actually improve performance with panda so what's the streak it's it's quite simple again it comes up to the top of the 9th nature of vitamin and somehow of the tradeoff in the saying OK for this part of my calculation I don't really need the dynamic nature I need the performance so if you look at this chart of the on the right side you see a typical python object in this case at least have object and object refers to an area of objects and ease of each of these object is actually a reference to a lot of memory I to structure in a memory that is Scott over so if you look at the left side which is the number area of very similar functionality if you're willing to about to leave with the assumptions number has like that of the same type for each element then you have a continuous memory structure very similar to what you should have a series and of course this could be much more efficient especially with our US abuse taking advantage of operations around these continues memory I so and this is of course just a small part of you check the system and if you are starting to use pondus eventually you will be able to use as many more maybe hundreds of other tools and libraries that give you all a great performance and the functionality and is part of the of the of the advantage of fusing partners
from so despite it would like to work it through a very simple about them I'm going use the Jupiter's and assume that everyone here is familiar with Jupiter from we see how it can do it it
and what it was and so that this and the
and the and
the need to talking so it so the
1st part is really just the basic importance some basic configurations optimal ponders and then usually the typical thing that you start with is that of a user of the input file can be seized the adjacent on whatever and this example just a little bit of a sample CC filed with the boring sales information are and then this provides a very nice way of promoting this that reason in these cases you can see just 1 method on and the 1st thing you can do we can just go to a friend of a friend is the basic structure right the Excel sheet that you can imagine that holds a list the and you can easily browse through this that and see different columns and rows and that it's quite easy to really go and get the 1st impression of what the heart that we hold so you can get it out in full of different columns of we see here object it means that the fundus hold Python object which is not the great thing because this case of what it means that you don't really gain do so out of performance Guinea talk about before because there's some graphing direction here because pundits is like still referring to Python objects and ideally you would like to see things like this with the floats your integers where these are our native non-playing of types which is where you gain the performance so if you see the price here you see that for some reason the price is an object and usually this kind of information and object means that you have strings because there's no need to string growth type that fungus uses it uses the Python string and therefore object so and if you see if you run some you find you get this weird results which is more looks more like a string concatenation then really a summary of all the numbers and if you look closely you you see that the issue and the issues that somewhere here there's a kind those the common which is a a common problem with CC Pfizer sometimes you load them and the comma is interpreted as something that's not numerical and then it in the inferred by this and further string of numbers so so with fundus it's quite easy to really explore that time series on were this you know the where these issues a household and so by using indexing save this but that the frames so I'm referring to the column DF price on so 1st they say show here that if I use a you funk under the ah which called contained a gets a serious of results of Boolean results in each 1 is actually on the bullion and that represents whether on my cell contains com or not and if a is the serious for indexing the that the frame so I go to the D S price but then I II an index it with a series of they got then I get a filter of to the frame that they started with this so up on this uses this serious as a something that determines what rules would like to see of course for a free true I get the role and for every force it's filtered out so here I see the issue I mean I have our a price with a comma and once I know that it's quite easy we again using you fungi of to deal with this issue I can just use replace in doing so I can also at the same time change the type 2 into just to make sure that we're working with the integers with drinks and knowing you in full lacy by it's means 64 and introduce some amount I'll get the expected results which are of course the summary of the points so by this is great for that exploration and so on and and you know coming from a general purpose programming or other fields and in
Python where you kind of have that the structure and that and it's like you feel like a create so much work to really understand what kind of that you have when you are used to ponder it becomes very easy to really just explore that understand more and more new provides a lot of the on functionality doing that and I want to go over really too many because the 2nd longest of hundreds of these functionalities the just going through a very simple sample very useful samples of describe is like the that's something that funders give you very easily it's like for every numerical column you get the different statistical measurements on and you can do it for a specific on our you can do like some kind of value count so if you have a value that repeats itself and this In this situation we have different products we can see what the frequency of each product of of course you can do all kind of other things and here you can see an example for more advanced feature in that the so the principle is very similar to are as you can see it can do here filter that's based on the range right between this number and this number and against the rules that apply with this with this condition on that was 2nd creates new columns with that with the use of in this case you know I can have look discount columns you can see from the land on the left and and in and then you can do a lot more advanced thing like blind statistics on his different corpora so here grew by product and I have the difference of these 2 coverage for that back into describes each of these groups and and have all the statistical measurements of on of was up on this provides a lot of visualizations no on tools on simple ones like the having histogram Nolde numerical values for the can choose 1 of them just show it the right vitamin any custom specific of chart that they need it's really and this on so when I I'll finish analyzing the that I usually when I say that the so of also hear pundits provided very easy to use on most features like and setting to J so on and setting to rescue right and you can see here the Pfizer have can read them back together right using you can really with a sequel query as well and good but the data on the last thing I show here so I won't be able to run but if you using jungle which we are and this is something that we use quite frequently so usually there some kind of a bridge between the jungle of that the structure and pundits and it's actually done very easily just use from records on the underside and for the of gender so I just need to use values and and specified that the things that you you're are interested in on so once you do that you in pondus Walden then everything everything is available for you all that and the last thing I wanna show you is like so this and actually in this case Jupiter is through some things about Jupiter that provides a notion to develop plug-ins and our hundreds of very useful plug-ins I wanna show you 1 of them here so I have my and that sets here and the and this is a pivot table so I can I can examined the debt and in on moral interactively right they can take of example can country here and then the decide go 1 glucose will by seeking adjust the countries of subdivision and then I can choose the inches of on average or the the conceit average price in different for different products in different countries different sentences that's very I I really quick glimpse of so what's done this provides and in the Quebec to home the to my presentation which is likely to happen in the 2 just a 2nd
yeah so we no
so I can't the the
OK and the but the
study and so
the the
the the the the the the the the
the it
so the and the so how fast how fast is it actually put the arena of ATP because can show here a very simple test N and using the 1 on the left side and I tight on our data structure in this case least on the right side of the pond us you can see for different operations of SOM featuring and multiplication you consume more or less so you get the 30 times faster results which is not bad right and if you have a someone of operation production that takes a 60 minutes and if you're introducing pandas it's going down to 2 seconds could really be quite a different and actual assure up through examples from the things we experience in production so the 1st 1 is that the we had a sink process that get information from a 3rd party and has to run a lot of comparisons with that of from our or and jungle and by refactoring our solution to spend as we we were able to get results 15 times faster than than before so and actually we get a clinical so you know that it's it's not something that it's like immediately you see it but it's once you're familiar with Anderson and you know how to read and use it the code is much cleaner and and being able to have this better performance and 15 times was really a very significant change for us but the 2nd example is that she was much more amazing it's like we had an example where we had to go use more intense calculations using Group III and things that are more involving like more complicated business logic and there we get 19 hundreds of times faster solution was really something that's was was looking like a mistaken and we looked at it and again and again and and this is actually the results and it's quite amazing to see it and it's clear that you know we could have taken the regional Python implementation and improve its without using ponders but there's no way we could get to such clean and easy to follow a solution that's the part of vendors and it's not just being able to give you a much more performant solution but this solution is usually nicer and much easier to follow well so if you decide to use us but you should know that there's upon this way and this is something that newcomers to ponder sometimes means mean In order to really enjoy uh the benefits from this you really need to use you Funk's as much as possible and if you have you find you need to use of applied and iterate over on which something you can do with 1 this can iterate over the rows that you should use Y which is a kind of customized you funk i and even if you get to the point where you need to iterate over the the roles in that a frame still you see some improvements in relative to point the and and and there's some situations word the more intuitive solution is not the right 1 and washer 1 1 of them here is something we actually saw and productions we had the situations where we have the 50 thousand throws was an example and we need to apply category according to some kind of mapping to summarize the CEU this example that category were in question mark should be seen because it's in the they trenches simple and the the simple straightforward approach was to use applied to supply with a function that ran the 2 year old called get category and it seemed simple and and but for some reason we so that the 2 around 61 seconds and was like well 61 seconds to have this simple relatively simple thing we want achieve here then by digging into its more we actually understood that before change of perspective here if we instead of use supply In sort of iterate over the rows using ponders we instead of that we iterate over the mapping the the grouping that we have and we let's say and this feature but to do that and it's magic and when we did that we actually get a result as more than 2 thousand times faster so we had a something the trend for the action with the with the original solution and that's it took 61 seconds and after this changes took only 26 ms which of course quite huge around so very quickly by using the clandestine Jupiter you only you also gets you get other benefits as well except for the very powerful tool that allows you to explore that you also get out of this notable which allows you to use to run it on multiple environment which could be very handy you can write it in the development destacan staging running thing and production you can share and all share the results etc. and right of time so just quickly summarize so my before my perspective the take away here is that you should do learn ponders it's not trivial there is some kind of learning curve but it's not that hard and I think you can gain from also a non that assigns so this kind of problems and and when you are the that when you do that thing you find it more useful of course in that analysis which is what that assigns to but also in sync processes that book before and could reports in exports any process that deals with all of the other and then when use on keep in mind that you have to be really flexible and the way you see things it's not this simple you know Turkey if our intuitive way of looking at that that but you have sometimes the changer perspective and when you do that you really get a lot of advantage thank you thank
you use the for this week as additional about using and those who oppose some simple for loops of inventions and been but false sense of the same thing you do that's what happens an well we ran out of time unfortunately is a very quick question raise your hand and a lot of otherwise but and you the to 1st thanks for your presentation my question would be if I hadn't encountered of case where no replacing of complex SQL query with some bond as logic would be because good the a the and the and the the question was whether we expand version where we had a very slow a sequel query and by replacing it with Bundesliga and the performance work it's really depends and usually if you have a sequel that goes to the server and runs the logic there and gets very small amount of that probably that's your best option right and you have a corre runs in the server gets only a very small portion of the time and that's it sometimes if the query is really too complicated and it's really justify the overhead of getting older that to memory dead and for sure you can do the stuff that could be more efficient on it actually are compound this has I'm also a query language and a way to alone and particulate is his needs in a way that's similar to is used to come but I wouldn't like when general case I wouldn't the gate to contest because of performance issue this cue it's not what I think should OK thank you thank you really looks and few