All You Need is Pandas: Unexpected Success Stories
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 132 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/44968 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Library (computing)Open sourceData analysisData structurePoint (geometry)Installation artFile formatSubject indexingDefault (computer science)Personal digital assistantEmailSample (statistics)Computer fileWeb pageoutputProcess (computing)Social classLine (geometry)Transformation (genetics)Electronic mailing listParsingPairwise comparisonIterationFrame problemStructural loadSymbol tableBuildingWordSoftware testingVariable (mathematics)Vertical directionMultiplicationPrice indexProcess (computing)Sheaf (mathematics)Computer fileFrame problemElectronic program guideMultiplication signForcing (mathematics)SequenceVideo gameParsingKey (cryptography)InformationPlotterSet (mathematics)Power (physics)Default (computer science)Price indexWindowLevel (video gaming)MereologyFile formatWordQuicksortSampling (statistics)Structural loadWeb pageSubsetNumeral (linguistics)Procedural programmingString (computer science)Limit (category theory)Graphics tabletCASE <Informatik>CuboidCubic graphMultiplicationSubject indexingoutputFunction (mathematics)EmailData structureRow (database)Bit rateCurveMaxima and minimaElectric generatorLibrary (computing)Message passingEquivalence relationSign (mathematics)Line (geometry)Latent heatSocial classPrimitive (album)Right angleGoodness of fitBlock (periodic table)Streaming mediaAreaProjective planeComputer configurationClique-widthParameter (computer programming)Order (biology)Data miningMathematical analysisElectronic mailing listEnvelope (mathematics)Data dictionaryFitness functionQuadratic equationClient (computing)Task (computing)CircleData conversionBuildingCategory of beingType theoryBitIntegerUtility softwareOperator (mathematics)ThetafunktionWritingMathematicsOrder of magnitudeSinc functionSoftwarePairwise comparisonPoint cloudFormal languageDifferent (Kate Ryan album)Data analysisTransformation (genetics)Suite (music)Family1 (number)Form (programming)Tablet computerAttribute grammarFlow separationOpen sourceSymbol tableSingle-precision floating-point formatObject (grammar)IterationSelectivity (electronic)Term (mathematics)Instance (computer science)Address spacePoint (geometry)Link (knot theory)Variable (mathematics)Solid geometryMobile appSoftware testingSoftware developerWeb applicationRevision controlGreen's functionVertex (graph theory)Inheritance (object-oriented programming)Electronic data processingAlgorithmFiber bundleTable (information)Server (computing)Web 2.0Reading (process)Letterpress printingZoom lensComputer animation
Transcript: English(auto-generated)
00:03
Thank you. Hello, everyone. Thanks for coming. My name is Mitter Naidanov and this is my first ever Europe Python talk. So I am quite passionate about pandas and I hope by the
00:23
end of my talk you might want to try it as well. So let me first tell you a few things about myself. So I have been a software developer for over 20 years now. I started back in the day with basic and Pascal, went to C, C++, C sharp, given PHP for three years
00:47
and then I discovered Python through Django and Python became my favorite language by far. Since then I've used it for pretty much everything. Server side software, scripting, web apps, mobile apps and all sorts of other things. So I was working for canonical for
01:11
four years and I was working on a port from Python to go of a cloud deployment suite and after that I decided it's time to get on my own. So I went full time into
01:27
freelancing with Python, again, happily, and founded my own company. So what about pandas? So seriously. How many of you have used pandas before? All right. Great. So
01:52
have you used it for anything else than scientific and statistical software? Okay.
02:01
So just a quick introduction for those of you who don't know about it. So pandas is an open source Python library. It was created in 2008 by Wes McKinney. It has high performance and easy to use data structures and a great API for data analysis built on the solid
02:24
foundation of numpy and it's also very well documented in a way. So I first heard about pandas in Europe Python 2012, I think. And since then I kept hearing about it from
02:44
all sorts of people all the time. And I decided to look into it and see actually what it's all about. I'm not from a scientific or financial background. So that was my first experience with it. Basically I liked about it that it's easy to install. It has very
03:06
few requirements. Especially on Linux it's trivial. But also on Windows and Mac OS. It's as fast as numpy yet a lot more flexible. And I personally don't really like numpy that much because I found it somewhat counterintuitive and awkward to use. Pandas
03:29
also reads and writes formats in pretty much any format you might have to deal with. Especially CSV, Excel and HDF5 to name just a few. Which was an obvious advantage
03:42
for me. And also since I'm quite a visual thinker, I like how easy it is to plot stuff with pandas. With matplotlib. So I did try it. But I found some works and pain points
04:02
which kind of put me off. And I want to share a few of them with you. So it has a good documentation. But at the time there were not a lot of tutorials and hands on guides. You know? It was a bit intimidating to read all of that documentation and know
04:22
where to start from. There are also confusingly many ways to do the same thing. Kind of in a way. At least then. Also there are lots of indexing. Like every sort of indexing operation. Which was it's also its power. But I didn't understand it. And it was kind
04:45
of seemed to me pointless. Especially the multi index. And it has same defaults for most things. It can handle lots of types of data intelligently. However, not as fast
05:00
as you might like. So you might want to actually be specific when you want to deal with specific types of data. Like date time or floats or integers and do some conversions in between. So let me tell you about the project of mine. Which I kind of found
05:23
unexpectedly how good FitPanda is for some of the tasks I have to deal with. So the project is an SVG mail label generator. Which means personalized mail in senders labeled
05:41
on the envelope in senders handwriting. And this is done by following a few requirements. So one of them is acquire a sample of the user's handwriting on a tablet. And it's
06:00
acquired in a vectorized SVG format. Then extract individual letter or symbol SVG files, small ones, from each of those sample pages per user. Then out of those compose arbitrary word SVG files. And make them look as if they're written by hand. And finally
06:26
generate mail labels from those words, sticking them together into multi-line, multi-word labels. So first the acquisition of handwriting samples is done on a tablet, stills or pen.
06:47
Every user gives one or more of those samples. And they're saved as SVG files. And this is an example of one of those. So basically it's a standardized text that every user
07:03
decides what to write. And it writes that sample on several different pages. So to have base for comparison, basically. And each of those things are basically SVG. The pen strokes
07:21
are recorded individually in the SVG file as vectorized curves. And this is, for example, how it looks like one of the outputs of that process, which is a mailing label done for one of the users. So the zooming is kind of weird. So this is the generalized
07:49
process. It's a multi-stage pipeline of sorts. So it first starts with the parsing of the SVG sample page. Then enter Pandas. Pandas is used to read those and present them
08:07
in a tabular fashion in a data frame. So they can be easily handled. Then there is a letter extraction process, which heavily uses Pandas to extract individual strokes and combine them as they were on the page. So that you can come from single individual
08:28
strokes to actual letters and then reuse those. Then there is a classification step, which is done manually and basically labels each of those extracted letters as ABC, dollar,
08:44
sign and so on. After we have this, there is the word building stage where we select letter variants for a specific word, stick them together, apply some alignment and so on. And finally, there is the labeling stage, which is producing labels out of those
09:06
words and aligning them ready for printing. So let's look into the parsing first. The problem is how to extract meaningful information from that XML SVG in Python.
09:23
And what I found is this excellent SVG path tools library, which has a lot to offer. So it has a path base class and a few subclasses thereof, like line, cubic bezier, quadratic bezier and a few other top level utilities. Each of those classes have rich
09:46
APIs for path intersection, calculating bounding boxes, transformations, scaling and all sorts of other things. You can cut paths, you can translate them and so on. And also it allows you to easily read and write lists of SVG paths into or from SVG
10:07
files and also apply some scaling and other things. And it just takes a single line. So this is basically an example of how easy it is to get those paths from a file.
10:26
And this SVG2 paths takes a file name and a bunch of other optional arguments deciding how to convert and what to convert. So it converts everything to those three primitives,
10:40
line, cubic and quadratic bezier. It handles arcs, circles and other things. It converts them all into those and returns a list of path instances and a list of dictionaries which contain the extra XML attributes of each of the paths.
11:00
So, once we have this, this is the easiest and simplest way I found. So we use pandas data frame dot from records, a class method which takes an iterable or in this case a generator of dictionary like objects with the same structure. And in this case
11:24
what I cared about is the actual index of that path instance within the file and as well it's a bounding box. So the minimum and maximum horizontal and vertical coordinates that fully encompass that stroke. And we get a structure that looks kind of like this.
11:47
Then on to the letter extraction. So the problem is quite computationally intensive if you address it from an algorithm. So you need to compare each stroke with all nearby
12:03
strokes which might have something to do with it and merge them together as letters. And what I found is that using a data frame simple iteration and filtering, albeit over multiple passes, you can do that easily and quite quickly as well.
12:21
So, the multiple passes are done by basically taking the data frame and returning it modified along with two sets of indices, one for merged paths and one for yet unmerged paths which you can see here using the data frame you can easily extract those and then each
12:46
of the steps which I'm going to show one of them which is this merging the fully overlapping paths. Basically all of them look like this. So we iterate over each, over the
13:01
data frame taking each path in sequence and then we filter the data frame, for example in this case all the paths that fully overlap their bounding box, fully overlap with this current path. We take this as candidates like a subset of the data frame, then we run
13:21
a fairly complicated merge procedure which I won't show because it's like a page and a half but basically what it does it updates the data frame so that when you merge two paths they have the same bounding box so updates the X min, X max and so on of both to match the combined bounding box of both and also updates those merged and unmerged
13:49
sets and returns the data frame. And after each of those steps we run an update data frame step which calculates additional properties for each of the paths and since
14:04
Pandas allows this quite easily you can chain assignments like this, like for example calculating the width or the height of the bounding box, the half width, half height which is used in some of the merge steps, also the area width multiplied by height and the
14:24
aspect divided by height. And finally we need to sort the values so that they come in kind of natural writing order, top to bottom, left to right. So then once we have this we have a bunch of smaller files, letter files which we then need to classify and
14:47
this is a deliberately manual process as per the client requirements. There is an external tool they used already for this sort of thing, there is no Pandas unfortunately. So it loads
15:01
the merged and unclassified letter SVGs, shows them one by one to a human, allows the human to align them in the box of the letter or the background and also allows them to label them, like this is a dollar sign, this is a capital A, this is a lower
15:21
case L and so on. Once we have this we have labelled SVG letter files, letter variants and then we come down to the word building. So this is an example of an intermediate output of the algorithm which is a debug version showing the letters, their bounding
15:42
boxes in green and the running baseline of the word which is the line along which all the letters are aligned so it looks like they're written on the same line. So it takes a single word as an input, for example testing, it does a selection process
16:04
for each letter either sequentially or randomly with the C it picks a labelled variant for that letter, then does horizontal composition merging selected variants with variable kerning which is a typical graphical term for the spacing between the letters and then there
16:25
is a vertical alignment step which according to the running baseline aligns certain letters like for example G, Y and others too, they are either below the baseline or above the
16:41
baseline as needed and outputs a single SVG file for that word in the same size. So the labelling, just to remind you how it looks, basically it takes as an input an Excel file with mail addresses, no surprise here, Pandas works great with this, so the
17:02
structure is one row per label, one column per line, as simple as parsing using Pandas read Excel and the generation stage builds words with variable kerning, so it takes spacing one for each column and the alignment is done with so called variable leading, so
17:27
the leading is the vertical equivalent of kerning, so the spacing between the lines and that's it basically, so I think I should tell you what I learned from this process
17:40
basically. So Pandas is great for any sort of table based data processing, that was kind of an unexpected discovery for me. So it might be intimidating at first if you haven't used it, there is what to read, but if you learn just a few things and start from there, like filtering and iteration, you can go a long way.
18:06
Also take time to understand the indexing and the power of multi index because that gives you the power to deal with multidimensional data in a very comprehensive way.
18:22
Then also of course any time you need to deal with CSV or Excel, which is quite a pain otherwise, with Pandas it's trivial and fast, doesn't have to be financial data or anything. And also the documentation is great, there is a lot to read, so it could
18:41
be a bit confusing at first, but I would suggest start with 10 minutes to Pandas, which is one of the main sections of the documentation. There are also a lot of tutorials now, a lot of cookbooks, you know, hands on guides and it grew a lot, there were actually
19:02
recently a documentation sprint for Pandas which expanded up even further. So with that, I have just one more thing to say. Please consider buying Wes McKinney's book Python for Data Analysis because it's great and it will help you a lot with your
19:23
journey into Pandas. And I'll be happy to take any questions. Thank you. Thanks very much. Are there any questions? We've got lots of time. Sorry, I may ask a silly question. I know you said all we need is Pandas. Have you
19:49
met any, I mean, in your practical user case, in your practical life, work, have you met some limitation of Pandas? Oh yeah. Well, there are things, quirks that you tend to learn to live with, but you tend to overcome as well. Like for
20:05
example dealing with any sort of numerical data that can have gaps in it or possibly strings or anything. They turn up as nans instead of, you know, something else. So if you expect to get integers, you might get floats instead. But yeah,
20:23
that's the type converter is one thing. Another user case I would like to rise to our community to do is from my work. The other case, the data, the data input we got is a nice to the JSON file. It's a nice to the JSON stream. So
20:47
the Pandas, you know, you use Pandas, read JSON, can only process one level. Yeah. So that makes it very... I haven't used it personally for JSON. I think Postgres is better for that, if you can afford it. I mean,
21:01
if you can have it at hand. I mean, my solution is I have to write my personal library to process this one into a data frame. But that's quite a static. So I was always thinking if the Pandas can absorb this feature, basically, he analyses the JSON files. Because the output is always...
21:23
Even though it's nice to the JSON, the output will be Pandas data frame. So I was thinking if Pandas can absorb the feature, basically, firstly, step one, analysis the JSON file to identify the keywords. Step two,
21:41
just crunch and get the data frame out. It will be an improvement for Pandas. But Pandas is splendid, I agree. My question is the limitation of Pandas. Yeah. So I'm sure you can go a long way using Pandas for some part
22:00
of that process, you know, reading the nested JSON. And for sure, if you can convert it to something more tabular, you'll get a lot more out of Pandas. Cool. Are there any other questions? No?
22:20
I hope you try it. All right. Thanks. Thank you.