We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

All You Need is Pandas: Unexpected Success Stories

00:00

Formal Metadata

Title
All You Need is Pandas: Unexpected Success Stories
Title of Series
Number of Parts
132
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Learning to use the awesome Pandas toolkit helped me immensely in lots of ways. Finding novel, efficient solutions to complex day-to-day problems with Pandas not only saves time, but can be fun and rewarding experience. In this talk I'll present use cases I had to solve, but the "traditional" approach proved tough and/or otherwise frustrating implement nicely. Since I was just starting to learn Pandas, decided to try an alternative solution with it. What I learned changed the way I think about data processing with Python, and it only got better since! The use cases deals with extracting pen strokes from handwritten SVG samples, and recomposing them into reusable letters and numbers. When you need to compare each stroke to all others, often more than once, resulted in inefficient, slow, and hard to maintain code. Even a naive Pandas approach with loops helped to reduce both the memory footprint, and improve the performance considerably! Improving the implementation further, vectorizing inner loops, and taking advantage of multi-index operations, I managed to get the same results, using less memory and a lot faster (by orders of magnitude).
Library (computing)Open sourceData analysisData structurePoint (geometry)Installation artFile formatSubject indexingDefault (computer science)Personal digital assistantEmailSample (statistics)Computer fileWeb pageoutputProcess (computing)Social classLine (geometry)Transformation (genetics)Electronic mailing listParsingPairwise comparisonIterationFrame problemStructural loadSymbol tableBuildingWordSoftware testingVariable (mathematics)Vertical directionMultiplicationPrice indexProcess (computing)Sheaf (mathematics)Computer fileFrame problemElectronic program guideMultiplication signForcing (mathematics)SequenceVideo gameParsingKey (cryptography)InformationPlotterSet (mathematics)Power (physics)Default (computer science)Price indexWindowLevel (video gaming)MereologyFile formatWordQuicksortSampling (statistics)Structural loadWeb pageSubsetNumeral (linguistics)Procedural programmingString (computer science)Limit (category theory)Graphics tabletCASE <Informatik>CuboidCubic graphMultiplicationSubject indexingoutputFunction (mathematics)EmailData structureRow (database)Bit rateCurveMaxima and minimaElectric generatorLibrary (computing)Message passingEquivalence relationSign (mathematics)Line (geometry)Latent heatSocial classPrimitive (album)Right angleGoodness of fitBlock (periodic table)Streaming mediaAreaProjective planeComputer configurationClique-widthParameter (computer programming)Order (biology)Data miningMathematical analysisElectronic mailing listEnvelope (mathematics)Data dictionaryFitness functionQuadratic equationClient (computing)Task (computing)CircleData conversionBuildingCategory of beingType theoryBitIntegerUtility softwareOperator (mathematics)ThetafunktionWritingMathematicsOrder of magnitudeSinc functionSoftwarePairwise comparisonPoint cloudFormal languageDifferent (Kate Ryan album)Data analysisTransformation (genetics)Suite (music)Family1 (number)Form (programming)Tablet computerAttribute grammarFlow separationOpen sourceSymbol tableSingle-precision floating-point formatObject (grammar)IterationSelectivity (electronic)Term (mathematics)Instance (computer science)Address spacePoint (geometry)Link (knot theory)Variable (mathematics)Solid geometryMobile appSoftware testingSoftware developerWeb applicationRevision controlGreen's functionVertex (graph theory)Inheritance (object-oriented programming)Electronic data processingAlgorithmFiber bundleTable (information)Server (computing)Web 2.0Reading (process)Letterpress printingZoom lensComputer animation
Transcript: English(auto-generated)
Thank you. Hello, everyone. Thanks for coming. My name is Mitter Naidanov and this is my first ever Europe Python talk. So I am quite passionate about pandas and I hope by the
end of my talk you might want to try it as well. So let me first tell you a few things about myself. So I have been a software developer for over 20 years now. I started back in the day with basic and Pascal, went to C, C++, C sharp, given PHP for three years
and then I discovered Python through Django and Python became my favorite language by far. Since then I've used it for pretty much everything. Server side software, scripting, web apps, mobile apps and all sorts of other things. So I was working for canonical for
four years and I was working on a port from Python to go of a cloud deployment suite and after that I decided it's time to get on my own. So I went full time into
freelancing with Python, again, happily, and founded my own company. So what about pandas? So seriously. How many of you have used pandas before? All right. Great. So
have you used it for anything else than scientific and statistical software? Okay.
So just a quick introduction for those of you who don't know about it. So pandas is an open source Python library. It was created in 2008 by Wes McKinney. It has high performance and easy to use data structures and a great API for data analysis built on the solid
foundation of numpy and it's also very well documented in a way. So I first heard about pandas in Europe Python 2012, I think. And since then I kept hearing about it from
all sorts of people all the time. And I decided to look into it and see actually what it's all about. I'm not from a scientific or financial background. So that was my first experience with it. Basically I liked about it that it's easy to install. It has very
few requirements. Especially on Linux it's trivial. But also on Windows and Mac OS. It's as fast as numpy yet a lot more flexible. And I personally don't really like numpy that much because I found it somewhat counterintuitive and awkward to use. Pandas
also reads and writes formats in pretty much any format you might have to deal with. Especially CSV, Excel and HDF5 to name just a few. Which was an obvious advantage
for me. And also since I'm quite a visual thinker, I like how easy it is to plot stuff with pandas. With matplotlib. So I did try it. But I found some works and pain points
which kind of put me off. And I want to share a few of them with you. So it has a good documentation. But at the time there were not a lot of tutorials and hands on guides. You know? It was a bit intimidating to read all of that documentation and know
where to start from. There are also confusingly many ways to do the same thing. Kind of in a way. At least then. Also there are lots of indexing. Like every sort of indexing operation. Which was it's also its power. But I didn't understand it. And it was kind
of seemed to me pointless. Especially the multi index. And it has same defaults for most things. It can handle lots of types of data intelligently. However, not as fast
as you might like. So you might want to actually be specific when you want to deal with specific types of data. Like date time or floats or integers and do some conversions in between. So let me tell you about the project of mine. Which I kind of found
unexpectedly how good FitPanda is for some of the tasks I have to deal with. So the project is an SVG mail label generator. Which means personalized mail in senders labeled
on the envelope in senders handwriting. And this is done by following a few requirements. So one of them is acquire a sample of the user's handwriting on a tablet. And it's
acquired in a vectorized SVG format. Then extract individual letter or symbol SVG files, small ones, from each of those sample pages per user. Then out of those compose arbitrary word SVG files. And make them look as if they're written by hand. And finally
generate mail labels from those words, sticking them together into multi-line, multi-word labels. So first the acquisition of handwriting samples is done on a tablet, stills or pen.
Every user gives one or more of those samples. And they're saved as SVG files. And this is an example of one of those. So basically it's a standardized text that every user
decides what to write. And it writes that sample on several different pages. So to have base for comparison, basically. And each of those things are basically SVG. The pen strokes
are recorded individually in the SVG file as vectorized curves. And this is, for example, how it looks like one of the outputs of that process, which is a mailing label done for one of the users. So the zooming is kind of weird. So this is the generalized
process. It's a multi-stage pipeline of sorts. So it first starts with the parsing of the SVG sample page. Then enter Pandas. Pandas is used to read those and present them
in a tabular fashion in a data frame. So they can be easily handled. Then there is a letter extraction process, which heavily uses Pandas to extract individual strokes and combine them as they were on the page. So that you can come from single individual
strokes to actual letters and then reuse those. Then there is a classification step, which is done manually and basically labels each of those extracted letters as ABC, dollar,
sign and so on. After we have this, there is the word building stage where we select letter variants for a specific word, stick them together, apply some alignment and so on. And finally, there is the labeling stage, which is producing labels out of those
words and aligning them ready for printing. So let's look into the parsing first. The problem is how to extract meaningful information from that XML SVG in Python.
And what I found is this excellent SVG path tools library, which has a lot to offer. So it has a path base class and a few subclasses thereof, like line, cubic bezier, quadratic bezier and a few other top level utilities. Each of those classes have rich
APIs for path intersection, calculating bounding boxes, transformations, scaling and all sorts of other things. You can cut paths, you can translate them and so on. And also it allows you to easily read and write lists of SVG paths into or from SVG
files and also apply some scaling and other things. And it just takes a single line. So this is basically an example of how easy it is to get those paths from a file.
And this SVG2 paths takes a file name and a bunch of other optional arguments deciding how to convert and what to convert. So it converts everything to those three primitives,
line, cubic and quadratic bezier. It handles arcs, circles and other things. It converts them all into those and returns a list of path instances and a list of dictionaries which contain the extra XML attributes of each of the paths.
So, once we have this, this is the easiest and simplest way I found. So we use pandas data frame dot from records, a class method which takes an iterable or in this case a generator of dictionary like objects with the same structure. And in this case
what I cared about is the actual index of that path instance within the file and as well it's a bounding box. So the minimum and maximum horizontal and vertical coordinates that fully encompass that stroke. And we get a structure that looks kind of like this.
Then on to the letter extraction. So the problem is quite computationally intensive if you address it from an algorithm. So you need to compare each stroke with all nearby
strokes which might have something to do with it and merge them together as letters. And what I found is that using a data frame simple iteration and filtering, albeit over multiple passes, you can do that easily and quite quickly as well.
So, the multiple passes are done by basically taking the data frame and returning it modified along with two sets of indices, one for merged paths and one for yet unmerged paths which you can see here using the data frame you can easily extract those and then each
of the steps which I'm going to show one of them which is this merging the fully overlapping paths. Basically all of them look like this. So we iterate over each, over the
data frame taking each path in sequence and then we filter the data frame, for example in this case all the paths that fully overlap their bounding box, fully overlap with this current path. We take this as candidates like a subset of the data frame, then we run
a fairly complicated merge procedure which I won't show because it's like a page and a half but basically what it does it updates the data frame so that when you merge two paths they have the same bounding box so updates the X min, X max and so on of both to match the combined bounding box of both and also updates those merged and unmerged
sets and returns the data frame. And after each of those steps we run an update data frame step which calculates additional properties for each of the paths and since
Pandas allows this quite easily you can chain assignments like this, like for example calculating the width or the height of the bounding box, the half width, half height which is used in some of the merge steps, also the area width multiplied by height and the
aspect divided by height. And finally we need to sort the values so that they come in kind of natural writing order, top to bottom, left to right. So then once we have this we have a bunch of smaller files, letter files which we then need to classify and
this is a deliberately manual process as per the client requirements. There is an external tool they used already for this sort of thing, there is no Pandas unfortunately. So it loads
the merged and unclassified letter SVGs, shows them one by one to a human, allows the human to align them in the box of the letter or the background and also allows them to label them, like this is a dollar sign, this is a capital A, this is a lower
case L and so on. Once we have this we have labelled SVG letter files, letter variants and then we come down to the word building. So this is an example of an intermediate output of the algorithm which is a debug version showing the letters, their bounding
boxes in green and the running baseline of the word which is the line along which all the letters are aligned so it looks like they're written on the same line. So it takes a single word as an input, for example testing, it does a selection process
for each letter either sequentially or randomly with the C it picks a labelled variant for that letter, then does horizontal composition merging selected variants with variable kerning which is a typical graphical term for the spacing between the letters and then there
is a vertical alignment step which according to the running baseline aligns certain letters like for example G, Y and others too, they are either below the baseline or above the
baseline as needed and outputs a single SVG file for that word in the same size. So the labelling, just to remind you how it looks, basically it takes as an input an Excel file with mail addresses, no surprise here, Pandas works great with this, so the
structure is one row per label, one column per line, as simple as parsing using Pandas read Excel and the generation stage builds words with variable kerning, so it takes spacing one for each column and the alignment is done with so called variable leading, so
the leading is the vertical equivalent of kerning, so the spacing between the lines and that's it basically, so I think I should tell you what I learned from this process
basically. So Pandas is great for any sort of table based data processing, that was kind of an unexpected discovery for me. So it might be intimidating at first if you haven't used it, there is what to read, but if you learn just a few things and start from there, like filtering and iteration, you can go a long way.
Also take time to understand the indexing and the power of multi index because that gives you the power to deal with multidimensional data in a very comprehensive way.
Then also of course any time you need to deal with CSV or Excel, which is quite a pain otherwise, with Pandas it's trivial and fast, doesn't have to be financial data or anything. And also the documentation is great, there is a lot to read, so it could
be a bit confusing at first, but I would suggest start with 10 minutes to Pandas, which is one of the main sections of the documentation. There are also a lot of tutorials now, a lot of cookbooks, you know, hands on guides and it grew a lot, there were actually
recently a documentation sprint for Pandas which expanded up even further. So with that, I have just one more thing to say. Please consider buying Wes McKinney's book Python for Data Analysis because it's great and it will help you a lot with your
journey into Pandas. And I'll be happy to take any questions. Thank you. Thanks very much. Are there any questions? We've got lots of time. Sorry, I may ask a silly question. I know you said all we need is Pandas. Have you
met any, I mean, in your practical user case, in your practical life, work, have you met some limitation of Pandas? Oh yeah. Well, there are things, quirks that you tend to learn to live with, but you tend to overcome as well. Like for
example dealing with any sort of numerical data that can have gaps in it or possibly strings or anything. They turn up as nans instead of, you know, something else. So if you expect to get integers, you might get floats instead. But yeah,
that's the type converter is one thing. Another user case I would like to rise to our community to do is from my work. The other case, the data, the data input we got is a nice to the JSON file. It's a nice to the JSON stream. So
the Pandas, you know, you use Pandas, read JSON, can only process one level. Yeah. So that makes it very... I haven't used it personally for JSON. I think Postgres is better for that, if you can afford it. I mean,
if you can have it at hand. I mean, my solution is I have to write my personal library to process this one into a data frame. But that's quite a static. So I was always thinking if the Pandas can absorb this feature, basically, he analyses the JSON files. Because the output is always...
Even though it's nice to the JSON, the output will be Pandas data frame. So I was thinking if Pandas can absorb the feature, basically, firstly, step one, analysis the JSON file to identify the keywords. Step two,
just crunch and get the data frame out. It will be an improvement for Pandas. But Pandas is splendid, I agree. My question is the limitation of Pandas. Yeah. So I'm sure you can go a long way using Pandas for some part
of that process, you know, reading the nested JSON. And for sure, if you can convert it to something more tabular, you'll get a lot more out of Pandas. Cool. Are there any other questions? No?
I hope you try it. All right. Thanks. Thank you.