We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

How to use pandas the wrong way

00:00

Formal Metadata

Title
How to use pandas the wrong way
Title of Series
Number of Parts
160
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
How to use pandas the wrong way [EuroPython 2017 - Talk - 2017-07-12 - Anfiteatro 1] [Rimini, Italy] UPDATE: slides and materials can be found at http://pietrobattiston.it/python:pycon#europython_rimini_july_2017 The pandas library represents a very efficient and convenient tool for data manipulation, but sometimes hides unexpected pitfalls which can arise in various and sometimes unintelligible ways. By briefly referring to some aspects of the implementation, I will review specific situations in which a change of approach can make code based on pandas more robust, or more performant. Some examples: inefficient indexing multiple dtypes and efficiency implicit type casting HDF5 storage overhead GroupBy.apply()... when you don't actually need i
95
Thumbnail
1:04:08
102
119
Thumbnail
1:00:51
Coma BerenicesSoftwareElectronic meeting systemOpen setDecision theoryEmbedded systemTask (computing)Extension (kinesiology)BitSubject indexingSoftware bugOperator (mathematics)Element (mathematics)ResultantData managementSeries (mathematics)Greatest elementDescriptive statisticsPosition operatorRevision controlFlow separationElectronic mailing listProgrammschleifeSemiconductor memoryStatement (computer science)Uniform resource locatorObject (grammar)Multiplication signData storage deviceData structureCodeDecision theoryLine (geometry)Set (mathematics)Similarity (geometry)Disk read-and-write headFrame problemLecture/Conference
Menu (computing)Ultraviolet photoelectron spectroscopyInclusion mapMoment of inertiaData structureSubject indexingLogicRow (database)Multiplication signCasting (performing arts)Series (mathematics)Frame problemDampingData storage deviceBitUsabilityIntegerBoolean algebraAdditionOperator (mathematics)Content (media)Object (grammar)NumberData typeType theoryElement (mathematics)LaptopSingle-precision floating-point formatPosition operatorLine (geometry)Moment (mathematics)Program slicingMultiplicationProgrammschleifeINTEGRALCodeComputer animation
User interfaceBlock (periodic table)Row (database)Arithmetic meanDampingMatrix (mathematics)Multiplication signOperator (mathematics)Object (grammar)Flow separationIntegerSeries (mathematics)Revision controlAbstractionFrame problemGoodness of fitVideo gameNumeral (linguistics)Group actionFunctional (mathematics)Array data structureLine (geometry)BitInstance (computer science)Data typeSingle-precision floating-point formatCodePoint (geometry)Type theorySemiconductor memoryCycle (graph theory)Fiber bundle2 (number)Function (mathematics)Maxima and minimaProgrammschleife
DreizehnFatou-MengeDifferent (Kate Ryan album)MereologyKey (cryptography)Series (mathematics)Frame problemElement (mathematics)Operator (mathematics)Line (geometry)ResultantElectronic mailing listSubject indexingSpacetimeObject (grammar)MiniDiscArithmetic meanGroup actionFile formatData storage devicePower (physics)Goodness of fitSingle-precision floating-point formatMessage passingBitRow (database)Characteristic polynomialInformationCellular automatonCodeImplementationTupleOverhead (computing)Functional (mathematics)PlastikkarteTransformation (genetics)Stack (abstract data type)Buffer overflow
Mixed realityArray data structureData typeFrame problemMultiplication signType theoryDifferent (Kate Ryan album)Loop (music)Lecture/Conference
Functional (mathematics)Loop (music)Library (computing)Multiplication signDampingExtension (kinesiology)Group actionGoodness of fitCASE <Informatik>Casting (performing arts)Operator (mathematics)Lecture/Conference
INTEGRALUser interfaceResultantWeb 2.0Lecture/Conference
Lecture/Conference
Source codeInformationContext awarenessTerm (mathematics)Video gameWebsiteSingle-precision floating-point formatCodeLaptopEstimatorLibrary (computing)BitAxiom of choiceLecture/Conference
Transcript: English(auto-generated)
So, yes, I am a researcher in economics, I don't know if there are any other researchers in economics here, I would better know.
Pandas is the thing that is helping me and some other people show that we can do research in economics with Python. It's really crucial for me and I'm not a programmer by profession, but I spend all my days coding Pandas so I learn some of the pitfalls. Now, the title might seem more philosophical than it is. It seems like I will
go in depth in some very conceptual description of some specific wrong way to use Pandas. It's not the topic. There are many different ways to use Pandas the wrong way and I will show some of them. Disclaimer, as I said, I love Pandas. My daily work wouldn't be in Python otherwise, I couldn't manage.
This says Pandas has bugs and quite a bit of them. You are very welcome to help us fix them. I'm an occasional contributor and I'm more than willing to help people who would like to contribute, learn a bit about the code base, which is
complex but not so complex in the sense that it's big, it could be tidier, but it's not too hard to learn the main concepts. So I'm trying to organize a sprint on Saturday, but I don't know of anybody who, I mean, I hope people are interested, but if you are, tell me so that we can organize.
This talk is not about Pandas bugs. It would be too easy. There are many, many, many. It's rather about things that, well, some are borderline, but are mainly design decisions which couldn't have been made very differently. And so it's the user's task to understand these things.
This is not even about wrong design decisions. The fact that most of you probably know NumPy more than Pandas. Now, I think NumPy is very intuitive. It's not trivial to use, but it's intuitive. Now, Pandas is in principle an extension of NumPy, which seems sometimes
more intuitive than it is, simply because it's complex. It does complex operations. Okay, let's start with some examples. So I will start with, this is an intermediate talk, so most of you will be bored by maybe the first five minutes.
Still, it's good to start from the basic mistakes one can do in Pandas. And for example, let's have a series of 10,000 elements and the same thing as a list. Now, Pandas is based on NumPy, and NumPy is good for managing large, well, depending on what you mean by large, but some thousands at least of elements.
And so, one would think it's good to store in a series that is a Pandas object 10,000, well, for the moment, nothing but 10,000 elements. Now, let's see some timings. For instance, for retrieving by positional index.
So, there is clearly no competition. So, the last one is trivial for most of you, but Pandas is good if it allows you to avoid Python loops. If you use Python loops, the single operation is much more complex because you have several layers of indirection, as we will see.
And so, if you are not able to use Pandas to make your work better with reasonable amounts of data, then don't use it. We can see another example. Again, the list is more than 100 times faster than Pandas. For comparison, this is a problem which is also present in NumPy.
Even in NumPy, you have some overhead, which, however, can be compensated by the set of the data. It's there, but it's much slower. So, for instance, we can compare these 7 microseconds with the 149 of Pandas, or the 1 millisecond with the 109 of Pandas.
This could be better than this, probably, but not to the point of NumPy. There is more structure under the code. Okay, let's get to something slightly less trivial. Pandas, a bit in the Python spirit, allows you to do a lot of things that you really don't want to do, or you want to avoid as much as possible.
And one of these is duplicated indexes. I mean, for some people, duplicated indexes are already a heresy. When I talk to some people using R and similar structures in R, they don't expect this. Now, Pandas allows you to do this. We build a data frame, which will have a
normal index from 0 to 99, and we repeat it, concatenate, and remove the first 50 lines. So it's 150 lines, we can take a look at its structure, it's basically this.
And then it repeats itself. The index is not unique, it's not even sorted. I didn't put this, but it's good to see. It's monotonic, false. It's an ugly index, really, you don't want to play with this. And why don't you want to play with this if it's possible?
Now, let's start with an example. If I take log 0, I get a series. Why? Because 0 is repeated only once. And so I can do this. And if I repeat the above, I see that the assignment worked fine.
I set the first line, so we can look at the head. Sorry, it's not the head. Okay, it worked fine. Now, it might be time to do this. And it's going to fail. Why? Because 99 is repeated. It's actually two lines, because the index is not unique. And this was a trivial case, but
it becomes very messy when you have an index with some elements which are repeated, some are not. You don't know how many times they are repeated. So, last one is you just want to avoid duplicated indexes. For instance, if we take the same data frame and we reset the index, now it's going to have a nice unique index.
It's also sorted. And this is the right way to work, basically. Log 0 is now only one series. And we can also see that this is slightly more efficient than the previous version.
In some cases, it's more than slightly more efficient. Well, I didn't tell anything about this. You probably know. I'm avoiding repetitions at the expense of precision, because Pandas has a lot of caching involved.
So, I don't want to show you caching results. And this brings the risk of unexpected results. But this is more expected. So, trust me, usually this is what you see. That is, the unsorted and non-unique index is lower.
Now, talking about indexes, Pandas allows you to do one thing which is not at all trivial. We remember that df had elements indexed until 99. Now, what if I do this? Well, it's just added to the bottom. So, this is different from what we used to, for instance, in NumPy.
If I have a NumPy array and I say, well, it goes until 4. If I say add a fifth line, it's going to say no. No way. There is no position of this kind. In fact, Pandas, in the label-based indexing mode, allows you to add a line without protesting.
Now, this is not necessarily a good thing to do. Let's consider this. We are adding 1000 elements to an empty series and it takes 400 milliseconds. Now, what if we had given the index in the beginning? So, it's exactly the same, but we're saying since the beginning we want exactly those elements.
It's way faster. So, in general, you want to be very cautious when you add elements which are not already in the index. Well, the reason is pretty simple. It's based on NumPy. There are contiguous structures.
If you add an element which is not there, it will have to change the location in memory of all the array. So, it's actually asymptotically bad too, not just here. Okay, and we can see also comparing the same statement twice.
It's going to be slow the first time because it's an element it didn't know and it's going to be faster the second time. If you don't believe me, this time I'm lucky with time it. Okay, this is a more standard problem, but some of the consequences are not so obvious. I mean, at least I've been beaten by them sometimes.
So, let's create a stupid data frame here. This is a stupid data frame. No, nothing inside. We don't care at the moment. This is possible because clearly this is possible, right?
I can instantiate one line and then get an element of the line. Now, I can also do this, but what is actually happening? Well, here I'm lucky. I'm actually setting an element in the third row, fourth column. That is third row, fourth position of that row. What happens if I do something slightly more sophisticated, so I use slicing?
Well, it's going to warn me. The value is trying to be set on a copy of the slice of the data frame. Why is it warning me? Well, basically because nothing is happening. So, it's a warning that is telling me you think you are setting something, you are not.
And why? Because when you use an indexer you don't, unless you know very well the code base, you never know whether a copy has been made or it's a review to the previous data. Now, this warning is standard. You probably saw it several times if you work on Pandas.
And so, when we say, well, I feel safe, there are two problems with this. First, the logic for deciding when to give this warning are very complicated, so don't rely too much on them. But a more subtle problem is that I often work on another data frame, which is derived from a previous one.
And when I work on that data frame, I might experiment and not think that I'm modifying the original one. This is nothing specific to Pandas. Admittedly, it's a general problem when you work on object without copying them. And so, you might modify the original data frame without noticing. So, in general, what you want to do instead is
And that's the right way to go if you want to work on this smaller data frame. Okay, one of the very cool things, no, one tends to think that the main addition of Pandas to NumPy
is represented by indexes. This is true, clearly, but it's maybe not the most complicated addition. Arguably, the most complicated addition to NumPy is represented by the possibility to manage different types in the same data frame object.
It's complicated not just for developers, but sometimes possibly for users. So, let's create a stupid data frame with a given index and no columns. And let's create a column, which is basically a copy of the index, because we are saying that column A is exactly the index.
And then we do the same thing, element by element. So, for each element in the index, you go... Sorry, this is wrong, this is times two, so it's the same. And then you do the same thing here, element by element. So, for each element, you take the element, multiply by two, and put in the column.
Now, certainly the second is less efficient, we all know, and if you didn't know, you saw my first notebook. But there's something worse than this. This is expected, right? We are populating a column with elements of the index. I didn't show you the index, but I can do it now. The index is just 0, 1, 2, 3, etc., because it's a standard index.
I passed no index. What happens here? It's slow, but it's not just slow, it's a float. And why is that? It's not the fault of Pandas. It's probably not even the fault of NumPy, although some discussion is ongoing. The fact is that integers have no knowledge of missing values.
So, if you create a column which is empty, it's going to be filled with missing values. So, for instance, df is this, I could say b equal to 1.
You would expect an integral column? No, it's a float, because I didn't tell him how to fill all the other rows. And this can be annoying. What is the solution? Well, it's very simple. You take some value which usually you use only to denote some missing value,
and you first instantiate, and then do whatever you want. Why? Why am I saying this? Clearly you don't want to have this loop, but you might have to have Python loops for things that call external functions or whatever. So, in those cases, if you have to work on integers and populate bit by bit data frame,
just instantiate to some unused integer value before, and you solve your problem. Yeah, this is what I created before, but this is the one I'm showing you. Okay. Now, since we have a data frame, I would like to say one thing that I forgot to introduce. Well, no, I'll say it later. Sorry.
Okay, again about data types, let's take a look at some non-totally expected castings. So let's create a series with two rows and 1,000 columns. Okay, so this is a shape, and as I told you, I want to be sure I'm working with integers, no missing values,
so I just instantiate to minus one, which is my marker for a missing value. So everything looks fine. What are the types? This is just the first four columns, but they are all integers, right? Now, this might be the, maybe it's trivial for most of you, but this can be the good time to tell something about the internals.
So what you see over data frame is basically this. You have the data, you have the columns, you have the index. What happens inside to allow you to use multiple data types is this. Each column has a data type, and remember you have some NumPy arrays,
true NumPy arrays, although in some abstraction, that store the columns of the same type, okay? So for instance, here we have a blue NumPy array which is storing the columns which are integer 64, then some data frame which is storing the float columns, and some NumPy array which is storing the object columns, for example.
So data types are characterized as single columns. So here I'm asking for the data types and I'm getting one data type for each column. They are all integer, fine. Now let's create a stupid method like telling me if a number is even or not.
And let's use this method to do some operation on this data. Basically I'm saying for each column, okay, you take the first element, so top row, you take the content and you add one if it's even, zero otherwise.
And then I do the same thing but without adding, only setting the value is even or not. This takes 200 milliseconds, this takes three times more. So this operation is taking one third the time of this operation.
What is happening? And remember there are no missing values here, it's all about the integers. Now what is happening is the following. If I take the first method, I'm adding values, so I'm adding a boolean to an integer,
and it's automatically casted by Python, not by Pandas to an integer. This is a boolean instead. And since it's a boolean, what Pandas does is record it as a boolean. But I told you that the types are characterized by column,
so if this happens, it means that all columns are now type object. So it means two bad things. The first is that we are working with objects while we had booleans and ints, so it's less efficient. And the second that it had to recast all of these columns to the new data type.
So the last one is Pandas is great for having multiple data types in a single data frame, but this happens only on columns, do not try, or avoid as possible on rows. And talking about columns, let's use another stupid data frame, which is similar to the previous one, but it's currently empty.
And let's fill it bit by bit, okay? So for each column, we are, just give me a second, just let me close a bit of stuff to be sure I'm not, finish my memory. Pull and halt, better wasting five seconds now than later with memory exhausted.
So what I'm saying, what I'm doing is I take each column and I populate it with two integers, which in the index and minus the index, pretty trivial. Okay, and this is the result.
Okay, nothing unexpected, how much time does it take? A bit too much. Now recall, there is no typecasting here, okay?
There are just integers, I'm setting them as integers. It's true, I'm adding a column by column, so three seconds. What if I had initialized the columns immediately? So exactly the same operation, but I start with a dataset which is already initialized.
So what do you expect compared to the previous one? Well, now it's too easy to answer. Okay, it's not faster.
And why is it not faster? Well, let's make another attempt. Now let's add the columns and initialize them to minus one. And let's fill this one, it's way faster. Another hint, let's initialize to a float, way slower.
So what is happening? I told you that in a data frame, columns of the same type are regrouped together in single NumPy arrays. And this makes a lot of sense, for instance, because when you have a data frame holding only data of the same type, you want operations to be as efficient
or almost as efficient asymptotically as in NumPy. But now let's go step by step and see what's happening. If I initialize the same data frame, I'm actually storing data in a single block, a block is an NumPy array or some abstraction of it, which has a shape, I would expect.
Now what happens when I set the first column to integers, right? Because it was a float data frame. When I set it to integers, I get what I told you. There is now one 9999 times two block and a smalls likes block holding the integers.
What happens when I add another column, the second? This is not what I told you. Actually, it's storing the two integer columns into separate blocks, so separate NumPy arrays. And why is it doing so?
Because otherwise the operations we just executed would be way slower, because every time it would have not just to create a new block for the new column, it would have to re-merge in a single NumPy array all the columns of the same type. And this is an expensive operation because it has to recopy everything in memory. And so on, another end block.
And indeed, you can check this method which tells you if the blocks, as I told you, are a single block for each data type. If we run almost any operation, for instance taking a stupid max, then we see that the blocks are restructured or technically consolidated.
And yeah, now they are consolidated. So, this is important to know in some cases, because the consequences can be really unexpected. If we compare these two functions, so each function is doing what we did before, that is adding a column,
and then taking the max. This other is doing the same but in two separate loops. So technically, this is a more difficult operation, because each max is run on the whole data frame, while this is running all the other columns inserted so far.
Let's compare them. So, the operation which is in principle quicker is taking more time. And why is that? Because every time, in every cycle of the loop,
this data frame has been reconsolidated. So, long story short, if you have a loop, and you have to work on the columns in this way, typically adding them or changing the type of them, then do it all once and do not try to do other operations, because Pandas is smart enough to try to
re-consolidate only when it has to, but when it has to, it's a costly operation, and it's an expensive operation. If we had started with integers, we wouldn't have had this problem. Or, actually, it's there, but it's much less important. Okay.
Again, on data types, also from another point of view, let's create an ugly data frame. Well, it's ugly, but not too much. It's got different data types, but they are all nicely ordered on different columns. So, this is the right way to work, and as an economist, I like Pandas, because it allows me, for instance, to put name of country and ID of firm
in the same data structure. So, the data types are what you would expect. Maybe some of you would not expect object, but Pandas does not support strings. And so, they are casted as outputs. What happens if I ask the mean of this strange beast?
Well, it's fairly smart. It's saying, well, column A is integer, it's one. Column, sorry, zero, column zero. Column one is 1.5, it's 1.5. Column three is 17. Column two is not numeric. I'm not gonna try to take a mean out of it, because it makes no sense. Great.
What will happen if I do this? So, I'm saying, okay, take the mean, but across the other axis. Again, it's pretty smart. It's doing the same identical thing, so excluding strings, but on the other axis. Good. Now, axis equal to one intuitively means,
run this operation on the transposed version of this matrix, which is a data frame. So, we should expect this to be a matrix. It's true anyway. We should expect these two objects to be the same. And instead, we get a value error.
Why is that? Well, precisely because axis equal one is smart. Smarter than this. When you do this, what is happening is actually that all the dtypes which were nicely ordered on columns are now on rows. That is, each column has different dtypes. But if a column has different dtypes,
it's only possible dtype is object. And if it's object, mean is not trying to get an average, a mean out of it. And so, you get an empty series. That is, it's a series of only columns with numeric dtype, which are known. All columns are object type now.
Okay. Whoever used Pandas for more than ten lines of code probably had some group by operations, because Pandas is really good for this. It's pretty efficient compared to other alternatives. Still, and probably you use, as I mostly used,
when I'm lazy, apply. Apply is a very powerful function. Now, Pandas is efficient for group by operations, but this does not necessarily efficient that apply is efficient. And why is that? Well, take, now we take some real life
of those stupid data frame, which contains a date, a ticker, a bid and an ask price. And we ask the mean of this. It's working, and let's just check what it's producing, for reference.
Sorry, I shouldn't have done this. For reference. Well, it's just taking the mean of bid and ask, because the rest is not numeric. Now, this took 91 milliseconds.
Is this the best we can do? Is it the best we can do? Not at all, by far. And what is the problem here?
It's that group by operation can have very different characteristics. You can have what is called an aggregate in Pandas, that is, you have several groups, and on each group you want to synthesize information in a series, or you can have, for instance, transform group by operations,
which means that each group does some operation involving the group structure, but then returns a value for each value passed. They're pretty different. For instance, in a transform, I could multiply each element by two. It's very stupid to do, but it's an example of an item-wise transform. Apply is very smart.
It's trying to understand what you want to do, and it understands this by looking at what you return from the passed function. It's smart, but the smartness has a cost. And this means that you want to use it only if aggregate or transform is not doing what you want. And I say this because apply, for instance,
on Stack Overflow is very often suggested and used. It's very powerful, but it's not particularly efficient, which is sad given how efficient is a group by object, per se. Something about multi-indexes, which is one of the things I love most in Pandas. Let me just close some stuff.
Okay. Now, let's create a data frame which is, again, very stupid, but has something new, this multi-index on the index. Sorry, let me skip this, because I have 10 minutes and more interesting other stuff. Pandas is great, also because it's very coherent.
That is, you have a data frame, you access a row, and you get a series. You access a column, and you get a series. This said, don't be fanatic. If we have a data frame with, for instance, a, b, c in each cell of its only column, and we want to access separately a, b, c,
something which is a very stupid operation, but I often end up having to do. We can do this, which is for each element in the column, you apply an operation which takes x and gives me back a series with pieces of x.
Very straightforward, very elegant. Since the data frame is conceptually made of series, I'm just producing the series and putting them in a data frame. Very elegant, but not necessarily the best way to operate, actually. Because if we compare this, that is the line of code I just showed you, with the following, in which I'm just returning
the result of x split, the difference is huge. And the difference is huge because we are wasting series to hold three elements. A series is a complicated object, which is worth using either if I have a lot of data or if I need complicated indexing or both,
not for three elements, which could be in a list. And so in this case, it's really a waste of computing power to have all the series instantiated to just fill the data frame. Finally, this is not strictly speaking about pandas, but it's about HDF, which is probably the most used
or at least, I think, the best accessible format to store objects on disk with pandas. Because it holds anything, it keeps the index structure, the types, et cetera. That said, even here, you don't have to be fanatic.
Okay, no problem. We are just creating 1000 series, the first with one element, the second with two, et cetera, et cetera. Very stupid thing. And we are storing them in a folder. And then we do the same thing, storing them in an HDF.
You already see it's much slower. I wouldn't expect so much slower, but let's have hope. And then what do you guess? It's going to use less space on disk.
This is 12 megabytes, this is 5.8. So in general, HDF is great, because it's good at storing big amounts of data and accessing, for instance, single rows. But don't think that it's the best way to store small objects, because it actually has a lot of overhead
that could use actually way more space than your data. If I have five more minutes, I'd like just to show one additional thing on multi-indexes, which can be unexpected. And it's the following.
Now, this is a data frame, as I said, with a multi-index on the index. Now, it's cool. What is the problem with this? If I, let's remain generic. If I ask, for instance, for,
let's change this a bit and put four, five, six. Okay, so they are all numbers. What is this going to say? Going to return? 07.
Well, you probably perfectly understand the ambiguity. I'm asking 14. Now, this could be interpreted in two ways. I could say, okay, I have a multi-index, and I look at 14 in the multi-index, or I have an object with two dimensions,
and I look at one on the first dimension, four on the second, because Pandas allows you to do partial indexing. This works, and it maybe works as most of you expect, but don't try too hard to guess. In general, you want to be more explicit.
That is, either you write 14 this, which says exactly in the multi-index, which I have in the index dimension, take 14, or use 14,
which is the opposite operation. Talking about this, here I use a typo. Now, in Python, most of the time, we are used to the fact that a typo and a list are actually the same thing, except for implementation details. List can be modified, but it's less efficient.
In Pandas, it's not true, and it's not true for good reasons, and that is that this, you see it's different. Incidentally, it's very similar, but conceptually, it's different,
and it's wrong to use this to access, to do this, basically. The difference is that tuples mean parts of the same key, so this means I want to access that key or better. I want to access anything starting with one.
This actually means I want to access one, which is partial indexing, and possibly other stuff, for instance, two. This is going to fail. In some cases, you can obtain what you expected,
even using the wrong type, but it's a simple association, and it's good to stick to it. Tuples are parts of keys. Lists are lists of keys, or parts of keys. And with this, I think I'm done.
Thank you very much for your talk, Pietro. We have time for a couple of questions.
Just a comment on the earlier stuff about the mixing D types when you're kind of looping over arrays. It seems to me that, you know, they're kind of all in this kind of topic of,
like, you shouldn't loop over a data frame, like, if you can avoid it. Sorry, so with the earlier stuff of these examples that you built by looping over data frames and inserting things in different ways, causing these problems relating to mixed data types and mixed...
Oh, sorry. The earlier stuff about, like, you constructed these examples where you... which were inefficient because you were kind of mixing data types and columns and deconsolidating the underlying arrays and so on.
It seems to me that, like, a lot of that can be avoided by the general principle that you shouldn't really be looping over a data frame in this way at all. Do you agree with that? Could you please speak into the microphone?
So the question was, in these examples I've shown, the lesson could also... For most of the examples I've shown, not just the first notebook, the lesson could be don't loop over the data. Now, the general lesson is,
don't loop over the data if you can avoid it. Then it all depends on what you have to do. There are functions I might need to go from external libraries. There are... maybe I'm looping over the data, but not by groups, so I'm doing the best I can to vectorize, but I need to work on each group individually.
Certainly, this is always a good lesson in NumPy and in Pandas. Try to vectorize as much as possible, and Pandas is, I think, really good in the extent to which it offers, for instance, group by operations which are ready to use and vectorized. Still, okay, many examples I gave are stupid, no doubt,
but there are some cases which... And, by the way, the casting problems are not necessarily related to looping. They can come out in other times even. For instance, if I set all the values I know in a series, but some are missing, and they were integer,
they will still become float even if there is no loop involved.
I'm not the most expert on this. I'm very happy with Pandas integration with matplotlib, not because it's perfect, but because it's very handy and allows you to refine the result using matplotlib itself.
But I know many people now are talking about, how it's called, the web bokeh, which I should try, I'm just ignorant, so I cannot judge. The integration with matplotlib is very nice, and, I mean, as a researcher, the web interface is less interesting to me, but I'm not an expert again.
This is just more of a comment, really. You as an economist, how do you find searching for micro-optimisations with using Python,
because basic tutorials I have seen and the documentation really just provides you basic explanation of the method, but you as a non-developer, how do you get around with that? So the question was on micro?
How do you search for this kind of stuff, you as a non-developer? I'm not sure I get the question. Well, how did you find out the micro-optimisations you show us here today?
Did you search through the source code? How did you get around that? Well, my honest answer is probably, I discovered Jupyter Notebooks, which allowed me to try five things and do the one working best. I was helped a bit by just a quick look at the code of Pandas,
which is very complex, but in some sense the main concepts are pretty clear. By the way, me as an economist still miss some things in the Python ecosystem in terms of methods, estimation methods, but in terms of manipulation,
I'm pretty sure I have a better life than all my colleagues using other softwares, despite some corners we have still too smooth.
Is there any source of more advanced information on Pandas I'm aware of? No, not organised. The docs are reasonably well organised, but not complete. I mean, they are not terrible either,
but they would benefit from many more examples, and if you come to the sprint site today we can work on that also. And then there are many people reporting their experiences in single Jupyter Notebooks maybe, but specifically in the economics field, very few, but it's growing quickly.
Let me just add one thing. I talk about estimation methods, so let me add that when Python is not enough, good integration between Python and R helps a lot, so PyR2 is a library of choice for this. Let's thank Pietro for his deep insights in Pandas.