Big Data with Python & Hadoop
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Part Number | 104 | |
Number of Parts | 173 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/20163 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Place | Bilbao, Euskadi, Spain |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
EuroPython 2015104 / 173
5
6
7
9
21
27
30
32
36
37
41
43
44
45
47
51
52
54
55
58
63
66
67
68
69
72
74
75
77
79
82
89
92
93
96
97
98
99
101
104
108
111
112
119
121
122
123
131
134
137
138
139
150
160
165
167
173
00:00
Maxima and minimaHacker (term)BenchmarkBenchmarkSlide ruleWordComputer configurationLibrary (computing)Link (knot theory)Computer animation
01:27
GoogolBoom (sailing)Maxima and minimaIntegrated development environmentMathematical analysisReal numberIcosahedronTotal S.A.Process (computing)Ideal (ethics)Different (Kate Ryan album)DatabaseMaxima and minimaDifferenz <Mathematik>Link (knot theory)Multiplication signAuthorizationBitCoefficient of determinationLibrary (computing)Flow separationUniqueness quantificationGoogolTerm (mathematics)Web pageLecture/ConferenceComputer animation
03:40
GoogolBoom (sailing)Virtual machineCodeReal numberFitness functionDifferent (Kate Ryan album)Process (computing)Multiplication signSign (mathematics)WindowLecture/Conference
04:24
Asynchronous Transfer ModeStapeldateiLinear mapComputer hardwareProcess (computing)MeasurementThread (computing)Execution unitMultiplication signComputer programmingSoftwareGroup actionMereologyComputer hardwareParallel portSlide ruleProjective planePhysical systemImage resolutionInterprozesskommunikationDifferent (Kate Ryan album)StapeldateiScalabilityHigh-level programming languageCASE <Informatik>Virtual machineSoftware frameworkComputerGene clusterBroadcast programmingSet (mathematics)Electronic data processingComputer animation
05:48
GoogolBoom (sailing)Block (periodic table)Replication (computing)Block (periodic table)Computer fileFunctional (mathematics)Data storage deviceFile systemPhysical systemDefault (computer science)Cartesian coordinate systemConfiguration spaceLecture/ConferenceComputer animation
06:49
GoogolBoom (sailing)Client (computing)Digital filterForm (programming)Function (mathematics)Structural loadCountingReal numberMetropolitan area networkJava appletStorage area networkArmInformation systemsEuler anglesLambda calculusWordTotal S.A.Binary fileExt functorIntelScripting language2 (number)Process (computing)Electronic program guideWordFunction (mathematics)Interactive televisionClient (computing)Java appletBoundary value problemIterationKey (cryptography)DialectGroup actionMultiplicationUser interfaceStructural loadDifferent (Kate Ryan album)Standard deviationString (computer science)outputDefault (computer science)Execution unitOperator (mathematics)Moment (mathematics)Line (geometry)Computer fileType theoryBitInformation securityActive contour modelSoftware developerStudent's t-testFood energyCodeLibrary (computing)Data storage deviceQuery languageWeb browserPhase transitionReduction of orderCountingPoint (geometry)DigitizingMultiplication signSummierbarkeitCASE <Informatik>Natural numberTerm (mathematics)ResultantQuicksortCanonical ensembleFunctional (mathematics)Level (video gaming)Flow separationStreaming mediaInterface (computing)MetreReal numberLecture/ConferenceComputer animation
15:12
GoogolBoom (sailing)Lambda calculusWordTotal S.A.Pointer (computer programming)IntelComputer fileFunction (mathematics)outputJava appletGotcha <Informatik>Data acquisitionBildschirmtextElectronic mailing listLocation-based serviceLine (geometry)Process (computing)Social classBitJava appletType theoryReduction of orderComputer fileStreaming mediaProcess (computing)Wrapper (data mining)Projective planeDifferent (Kate Ryan album)Parameter (computer programming)BuildingSoftware frameworkCountingResultantOnline helpCommunications protocolScripting languageLibrary (computing)Virtual machineGroup actionoutputPhase transitionRaw image formatSlide ruleMetropolitan area networkState of matterComplete metric spaceData storage deviceSerial portOpen sourceSoftware developerLevel (video gaming)Open setIdentical particlesDemosceneEndliche ModelltheorieWave packetLecture/ConferenceComputer animation
21:15
GoogolBoom (sailing)CountingSocial classFunction (mathematics)Line (geometry)Cloud computingMetropolitan area networkCommunications protocolLibrary (computing)Particle systemRevision controlReduction of orderProcess (computing)Streaming mediaSlide ruleLecture/ConferenceComputer animation
22:10
GoogolBoom (sailing)DisintegrationMultiplicationParticle systemProcess (computing)Multiplication signINTEGRALRaw image formatSerial portSoftware developerStreaming mediaSoftware frameworkSoftware testingLocal ringComputer fileElasticity (physics)Utility softwareDigitizingLecture/ConferenceComputer animation
23:46
GoogolBoom (sailing)AutomorphismMetropolitan area networkValue-added networkTask (computing)Real numberDataflowDisintegrationVisual systemSoftware testingFactory (trading post)Social classSima (architecture)ImplementationWordStructural loadoutputFunction (mathematics)Ext functorGrand Unified TheoryPort scannerStorage area networkNewton's law of universal gravitationEuler anglesBefehlsprozessorSoftwareJava appletCountingProcess (computing)Maxima and minimaLength of stayLucas sequenceBenchmarkAliasingSineLogarithmCurve fittingUniformer RaumHigher-order logicElectronic mailing listBit rateOcean currentSocial classPairwise comparisonPhase transitionCombinational logicComputer architectureProcess (computing)Multiplication signJava appletVariable (mathematics)outputLevel (video gaming)Differenz <Mathematik>Interface (computing)MereologySoftware testingReduction of orderImplementationBenchmarkRun time (program lifecycle phase)SoftwareSoftware framework2 (number)WordLine (geometry)Metropolitan area networkNumberMathematical optimizationCountingWrapper (data mining)Table (information)Row (database)BitLecture/ConferenceComputer animation
28:11
GoogolBoom (sailing)Lucas sequenceProcess (computing)BenchmarkTabu searchGamma functionMetropolitan area networkInclusion mapCombinational logicSoftware frameworkSpeech synthesisMultiplication signJava appletBenchmarkWave packetBitTranslation (relic)Lecture/ConferenceTable
29:33
GoogolBoom (sailing)CountingProcess (computing)Reading (process)Real numberWordCASE <Informatik>Standard deviationLecture/Conference
30:09
Metropolitan area networkBenchmarkPersonal area networkCurve fittingNewton's law of universal gravitationInclusion mapData acquisitionTabu searchGamma functionDensity of statesSima (architecture)ImplementationReal numberDisintegrationDataflowTask (computing)Visual systemSoftware testingValue-added networkHaar measureTotal S.A.Factory (trading post)Loop (music)Library (computing)HypermediaActive contour modelSoftware frameworkArithmetic meanSoftware testingScheduling (computing)Line (geometry)Graph (mathematics)Codierung <Programmierung>Condition numberMultiplication signStandard deviationSoftware developerInstallation artDefault (computer science)Row (database)Numbering schemeSerial portProcess (computing)MereologyWordParticle systemBitLevel (video gaming)CodeArmCountingGoodness of fitFunctional (mathematics)Reduction of orderTask (computing)Hand fanReal numberINTEGRALCASE <Informatik>Client (computing)BenchmarkDifferent (Kate Ryan album)Communications protocolObservational studyComputer animation
35:16
GoogolBoom (sailing)ImplementationInstance (computer science)Active contour modelSoftware frameworkRow (database)ImplementationJava appletoutputLecture/ConferenceComputer animation
36:05
GoogolBoom (sailing)ImplementationBit rateWordStructural loadFunction (mathematics)Ext functoroutputCountingInclusion mapMetropolitan area networkTotal S.A.Gamma functionPort scannerJava appletEuler anglesTask (computing)BefehlsprozessorSoftwareProcess (computing)BenchmarkLength of staySelf-organizationProcess (computing)WebsiteStack (abstract data type)Variable (mathematics)Buffer overflowNumber lineReal numberParameter (computer programming)Numbering schemeDatabaseMereologyObject (grammar)Library (computing)IP addressFunction (mathematics)Online helpMaxima and minimaINTEGRALSelf-organizationFormal languageFunctional (mathematics)WordCountingWave packetDemosceneComputing platformWeb crawlerScripting languageComputer configurationLevel (video gaming)Different (Kate Ryan album)Projective planeImplementationKey (cryptography)Interface (computing)CASE <Informatik>BenchmarkLine (geometry)Active contour modelAuthorizationComputer fileMetropolitan area networkSoftware frameworkStreaming mediaUser-defined functionJava appletCodeGeometryRepository (publishing)Multiplication signChainExtension (kinesiology)Lecture/ConferenceComputer animation
41:54
GoogolBoom (sailing)DisintegrationSoftware testingIntegerImplementationSoftware frameworkUsabilityINTEGRALGoodness of fitSoftware testingLocal ringSoftware developerMultiplication signWordPerfect groupJava appletLibrary (computing)MereologyProcess (computing)CountingLecture/Conference
43:06
GoogolBoom (sailing)CodeMaxima and minimaRed HatRobotSlide ruleCodeBenchmarkOnline helpMathematicsLecture/ConferenceComputer animation
Transcript: English(auto-generated)
00:04
hopefully interesting. So, hello, welcome everyone. I hope you're enjoying EuroPython as much as I do. And for the next 45 minutes, you can just sit, relax and enjoy the talk about big data with Python and Hadoop.
00:22
Slides are already at slideshare.net and I'll give you the link at the end of the talk. And this is our agenda for today. At first, a quick introduction about me and my company. So, you get an idea about what do we use Hadoop for.
00:40
Then a few words about big data, Apache Hadoop and its ecosystem. Next, we'll talk about HDFS and third party tools that can help us to work with HDFS. After that, we'll briefly discuss MapReduce concepts and talk about how we can use Python with Hadoop.
01:00
What options do we have, like what third party libraries are out there reading Python, of course, about their pros and cons. Next, we'll briefly discuss a thing called Pig. And finally, we'll see the benchmarks of all the things we've talked about earlier. These are freshly baked benchmarks
01:23
which I made a week ago just before coming to EuroPython and they're actually quite interesting. And of course, conclusions. By the way, can you please raise your hands? Who knows what Hadoop is working with Hadoop or maybe worked with Hadoop in the past? Okay, okay, thanks.
01:42
Not too much. All right. This is me. My name is Max. I live in Moscow, Russia. I'm the author of several Python libraries. There's a link to my GitHub page if you're interested. I also give talks on different conferences from time to time and contribute
02:00
to other Python libraries. I work for the company called Adata. We collect and process online and offline user data to get the idea of user's interests, intentions, demography, and so on. In general, we process like more than 70 million users per day.
02:20
There are more than 2,000 segments in our database like users who are interested in buying a BMW car or users who like dogs or maybe users who watch porn online, you know? We have partners like Google DBM, Turn, AppNexus, and many more.
02:41
We have quite a big worldwide user coverage and we process data for more than one billion unique users in total. We have one of the biggest user coverage in Russia and Eastern Europe. For example, for Russia, it's about 75% of all users. So having said all that, you can see that we have a lot of data to process
03:01
and we can see to ourselves a data-driven company or a big data company like some people like to call it now. So what exactly is big data? There is actually a great quote by Dan Ariely about big data. Big data is like teenage sex.
03:21
Everyone talks about it. Nobody really knows how to do it. Everyone thinks everyone else is doing it so everyone claims they are doing it. Nowadays, actually, big data is mostly a marketing term or a buzzword. Actually, there is even a tendency of arguing
03:41
like how much data is big data and different people tell different things. In reality, of course, only a few have real big data like Google or CERN, but to keep it simple for the rest of people, big data can be probably considered big if it doesn't fit into one machine
04:01
or it can't be processed by one machine or it takes too much time to process by one machine. But the last point, though, can also be a sign of big problems in code and not a big data. Now that we figured out that we probably have a big data problem, we need to solve it somehow.
04:21
This is where Apache Hadoop comes into play. Apache Hadoop is a framework for distributed processing of large data sets across clusters of computers. It's often used for batch processing and this is a use case where it really shines. It provides linear scalability which means
04:42
that if you have twice as many machines, jobs will run twice as fast and if you have twice as much data, jobs will run twice as slow. It doesn't require super cool expensive hardware. It is designed to work on unreliable machines that are expected to fail frequently.
05:03
It doesn't expect you to have knowledge of inter-process communication or threading RPC or network programming and so on because parallel execution of the whole cluster is handled for you transparently. Hadoop has a giant ecosystem which includes
05:21
a lot of projects that are designed to solve different kinds of problems and some of them are listed on this slide. More just didn't fit in. HDFS and MapReduce are actually not a part of ecosystem but a part of Hadoop itself and we'll talk about them on the next slides and we'll also discuss Pig which is a high level language
05:42
for parallel data processing using Apache Hadoop. I won't talk about the others because we simply don't have time for it so if you are interested, you can Google this for yourself. So HDFS, it stands for Hadoop Distributed File System.
06:01
It just stores files in folders, it chunks files into blocks and blocks are scattered randomly all over the place. By default, the block is 64 megabytes but this is configurable and it also provides a replication of blocks. By default, three replicas of each block are created but this is also configurable.
06:23
HDFS doesn't allow to edit files, only create, read and delete because it is very hard to implement and edit functionality in distributed system with replication. So what they did was just, you know, why bother in implementing editing files
06:40
when we can just make them not editable? Hadoop provides a command line interface to HDFS but the downside of this is that it is implemented in Java and it needs to spin up a JVM which takes up from one to three seconds before a command can be executed
07:01
which is a real pain, especially if you are trying to write some scripts and so on. But thankfully, to great guide Spotify, there is an alternative called Snakebyte. It's an HDFS client written in pure Python. It can be used as a library in your Python scripts
07:22
or as a command line client. It communicates with Hadoop via RPC which makes it amazingly fast, much, much faster than native Hadoop command line interface. And finally, it's a little bit less to type to execute a command, so Python for the win.
07:40
But there is one problem though. Snakebyte doesn't handle write operations at the moment. So while you are able to make meter operations like moving files, renaming them, you can't write a file to HDFS using Snakebyte. But it is still in very active development so I'm sure this will be implemented at some point.
08:04
This is an example how Snakebyte can be used as a library in Python scripts. It's very easy, we just import client, connect to Hadoop and start working with HDFS. It's really amazing and simple. There is also a thing called Hue.
08:21
Hue is a web interface to analyzing data with Hadoop. It provides awesome HDFS file browser. This is how it looks like. You can do everything that you can do through native HDFS command line interface using Hue.
08:40
It also has a job browser, a designer for jobs so you can develop big scripts and Apollo Hive queries and a lot of more stuff. It supports Zuki, Perusi and many more. I won't go into details about Hue because again we don't have time for this
09:02
but this is the tool that you'll love if you don't use it to try it. By the way, it's made on top of Python and Django. So again, Python for the win. So now when we know how Hadoop stores its data,
09:20
we can talk about MapReduce. It's a pretty simple concept. There are mappers and reducers and you have to code both of them because they're actually doing data processing. What mappers basically do is they load data from HDFS, they transform, filter or prepare this data somehow
09:44
and output a pair of key and value. Mappers output then goes to reducers but before that some magic happens inside Hadoop and mappers output is grouped by key. This allows you to do stuff like aggregation,
10:02
counting, searching and so on in the reducer. So what you get in the reducer is the key and all values for that key and after all reducers are complete, the output is written to HDFS.
10:20
So actually the workflow between mappers and reducers is a little bit more complicated. There is also a shuffle phase, a sort and sometimes secondary sort, your combiners, partitioners and a lot of different other stuff but we won't discuss that at the moment. It doesn't matter for us. It's perfectly fine to consider
10:41
that there is just on the mappers and reducers and some magic is happening between them. Now let's have a look at the example of MapReduce. We will use the canonical word count example that everybody uses. So we have a text used as an input
11:01
which consists of three lines. Python is cool, Hadoop is cool and Java is bad. This text will be processed by, it will be used as an input which consists of three lines.
11:20
So it will process line by line like this and inside a mapper, line will be splitted into words. Like this. So for each word in a map function, a map function will return a word and a digit one and it doesn't matter if we made this word twice
11:42
or three times or we just output a word and a digit one. Then some magic happens provided by Hadoop and inside the reducer we get all values for a word grouped by this word.
12:01
So we just need to sum up these values in the reducer to get the desired output. This may seem unintuitive or complicated at first, but actually it's perfectly fine and when you're just starting to do MapReduce, you have to make your brain think in terms of MapReduce
12:22
and after you get used to it, it's all will become very clear. So this is the final result for our job. Now let's have a look at how our previous word count example will look like in Java. So now you probably understand why you earn
12:43
so much money when you code in Java because more typing means more money and can you imagine how much code you should write for a real word use case using Java?
13:01
So now after you've been impressed by the simplicity of Java, let's talk about how we can use Python with Hadoop. Hadoop doesn't provide a way to work with Python natively. It uses a thing called Hadoop streaming.
13:20
The idea behind this streaming thing is that you can supply any executable to Hadoop as a mapper or reducer. It can be standard Unix tools like CAD or UNIX or whatever and or Python scripts or Perl scripts or Rabi or PHP
13:41
or whatever you like. So the executable must read from standard in and write to standard out. This is a code for a mapper and reducer. The mapper is actually very simple. We just read from standard input line by line and we split it into words
14:01
and output the word in digit one using a tap as a default separator because it's a default Hadoop separator. You can change it if you like. So one of the disadvantages of using streaming directly is this input to the reducer.
14:24
I mean it's not grouped by key. It's coming line by line so you have to figure out the boundaries between key piece by yourself. And this is exactly what we do here in the reducer. We are using a group by
14:43
and it groups multiple word count pairs by word and it creates an iterator that returns consecutive keys and the group. So the first item is the key and the values,
15:03
the first item of the value is also the key so we just filter it, we use an underscore for it and then we cast a value to int to sum it up. It's pretty awesome compared to how much you have to type in Java
15:22
but it's still maybe like a little bit more, a bit complicated because of the manual work in the reducer. This is a comment which sends our MapReduce job to Hadoop via Hadoop streaming and we need to specify a Hadoop streaming jar
15:44
and a path to a mapper and reducer using the mapper and reducer arguments and input and output. One interesting thing here is the two file arguments where we specify the path to a mapper and reducer again
16:01
and we do that to make Hadoop to understand that we wanted to upload these two files to the whole cluster. It's called Hadoop distributed cache. It's a place where it stores all files and resources that are needed to run a job and this is a really cool thing
16:21
because imagine you have a small cluster of four machines and you just wrote a pretty cool script for your job and you used an external library which is not installed on your cluster obviously. So if you have like four machines,
16:42
you can log into every machine and install this library by hand but what if you have a big cluster like of 100 machines or 1,000 machines that just won't work anymore? Of course you could create some bash script or something that will do the automation for you
17:03
but why do that if I do probably provide a way to do that? So you just specify what you want Hadoop to copy to the whole cluster before the job will start and that's it. And also after the job is completed,
17:22
Hadoop will delete everything and your cluster will be in its initial state again. It's pretty cool. And after our job is complete, we get the desired results. So Hadoop streaming is really cool but it requires you to do a little bit of extra work
17:42
and though it's still much simpler compared to Java, we can simplify it even more with the help of different Python frameworks for working with Hadoop. So let's do a quick overview of them. The first one is Dumbo. It was one of the earliest Python frameworks for Hadoop
18:03
but for some reason it's not developed anymore. There's no support, no downloads, so just let's forget about it. There is a Hadoopie, or Hadoopie, I don't know. It's the same situation as with Dumbo. The project seems to be abandoned
18:20
and there are still some people trying to use it according to PyPI downloads. So if you want, you can also try it, I don't. There is a PyDoop. It's a very interesting project while other projects are just wrappers around Hadoop streaming.
18:41
PyDoop is, it uses a thing called Hadoop pipes which is basically C++ API to Hadoop and it makes it really fast, we'll see this. There's also a Luigi project. It's also very cool. It was developed at Spotify. It is maintained by Spotify.
19:02
Its distinguishing feature is that it has an ability to build complex pipelines of jobs and support many different technologies which can be used to run the jobs. And there is also a thing called mr-job.
19:20
It's the most popular Python framework for working with Hadoop. It was developed by Yelp and it's also cool. But there are some things to keep in mind while working with it. So we'll talk about PyDoop, Luigi and mr-job in more details in next slides.
19:44
So the most popular framework is called MapReduce Job or mr-job or Mr. Job like some people like to call it. So I also like this. Mr. Job is a wrapper around Hadoop streaming and it is actively developed by Yelp and maintained by Yelp and used inside Yelp.
20:03
This is how our work count example can be written using Mr. Job. It's even more compact. So while a mapper looks absolutely the same as in the raw Hadoop streaming, just notice how much typing we saved in the reducer.
20:23
But behind the scenes, actually Mr. Job is doing the same group by aggregation which is solved previously in the Hadoop streaming example. But as I said, there are some things to keep in mind.
20:41
Mr. Job uses so-called protocols for data serialization, deserialization between phases. And by default, it uses a JSON protocol which itself uses Python's JSON library which is kind of a slow. And so the first thing you should do
21:00
is to install simple JSON because it is faster. Or starting from Mr. Job 0.5.0 which I think still in development, it supports ultra JSON library which is even more faster.
21:21
Dude, this is how you can specify this ultra JSON protocol and again, this is available only starting from 0.5.0. Lower versions use simple JSON which is slower. Mr. Job also supports raw protocol
21:42
which is the fastest protocol available but you have to take care about serialization, deserialization by yourself as shown on this slide. So notice how we cast one to string in the mapper and some to string in the reducer.
22:02
Also with the introduction of ultra JSON in the next version of Mr. Job, I don't think there is a need to use these raw protocols because they are not so much faster actually compared to ultra JSON and at least most of the time,
22:21
of course it depends on the job. And so you have to experiment for yourself and see what fits best for you. So Mr. Job pros and cons. In my opinion, it has like best documentation compared to other Python frameworks.
22:41
It has best integration with Amazon's EMR which is Elastic MapReduce and compared to other Python frameworks because Yelp uses, it operates inside EMR so it's understandable. It has very active development,
23:01
biggest community, it provides really cool local testing without Hadoop which is very convenient while doing development. And it also automatically uploads itself to a cluster. And it supports multi-step jobs
23:21
which means that one job that will start only after the other one is successfully finished. Or you can also use bash utilities or jar files or whatever in this multi-step workflow. The only downside that I can think of is a slow serialization and deserialization
23:43
compared to raw Python streaming. But compared to how much typing it saves you, we can probably forgive it for that. So this is not really a big con. The next in our list is Luigi.
24:03
Luigi is also a wrapper around Hadoop streaming and it is developed by Spotify. This is how our work count example can be written using Luigi. It is a little bit more verbose compared to Mr. Job
24:20
because Luigi concentrates mainly on the total workflow and not only on a single job. And it also forces you to define your input and output inside a class and not from a common line interface.
24:40
As for the mapper and reducer implementation, they are absolutely the same. Four minutes left.
25:03
I have so much to say. Four minutes, okay, okay. So Luigi also has this problem with serialization, deserialization, and you also have to use ultra-json. Just use ultra-json and everything will be cool.
25:26
Okay, so we'll probably skip that. It's also cool, Luigi is cool. But not so good for local testing. And we'll also skip Pydoop. Okay, okay.
25:41
Okay, oh man. All right, all right, okay, benchmarks. This is the most important part. So this is probably why a lot of people are there for the benchmarks. So this is a cluster and software that I used to do the benchmarks.
26:01
So the job was a simple work count on a well-known book about Python by Mark Lutz and I multiplied it 10,000 times. Which gave me 35 gigabytes of data. And I also used a combiner between a map
26:20
and reduce phase. So a combiner is basically a local reducer which just runs after a map phase. And it is kind of an optimization. So this is it. This is the table. Java is fastest of course, no surprise here.
26:40
So it is used as a baseline for performance. All numbers for other frameworks are just ratios relative to Java values. So for example, we have a job run time for Java like 187 seconds which is three minutes and something.
27:03
To get the number for Pydoop, you need to multiply 187 by 1.86 which will give you 47 seconds which is almost six minutes. So each job, I ran a job three times
27:20
and the best time was taken. And so let's discuss a few things about this performance comparison. So Pydoop is the second after Java because it uses this Hadoop pipes C++ API. It still takes almost twice as slow
27:42
compared to the native Java but another thing that may seem strange is the 5.97 ratio in the reduced input records. So it looks like the combiners didn't run but there is an explanation to that in Pydoop manual. It says the following.
28:02
One thing to remember is that the current Hadoop pipes architecture runs the combiner under the hood of the executable run by pipes. So it doesn't update the combiner counters of the general Hadoop framework. So this is why we have this. Then comes PIC.
28:21
I actually thought that PIC should be the second after Java before I ran these benchmarks but unfortunately I didn't have really time to investigate the reasons. So I just can't say why it is slower because PIC translates itself into Java so it should be almost as fast as Java.
28:42
Then comes raw streaming under CPython and PyPy and you probably may be surprised that PyPy, no? Okay, do you have any questions or I just can continue?
29:05
Okay, okay, so. Yeah, so it's actually, I'm speaking for half an hour and this is a 45 minute talk so I still have 50 minutes.
29:20
No questions, you see. Okay, so yeah, CPython and PyPy. Yeah, you probably may be a bit surprised that PyPy is slower but actually the thing is that the work count is a really simple job
29:45
and PyPy is currently slower than CPython when dealing with reading and writing from standard in and standard out. So it really depends on the job.
30:01
In real world use cases, PyPy is actually a lot more faster than CPython. So what we usually do, we implement a job and then we just run it on PyPy and CPython and see what's the difference. And like I said, in most cases, PyPy wins.
30:22
So just try for yourself and see what fits best for you. Then comes Mr. Job and as you see, ultra JSON is just a little bit slower than these raw protocols but it saves you
30:43
the pain of dealing with manual work. So just I think use ultra JSON. And finally Luigi which is much, much slower even with ultra JSON than Mr. Job.
31:01
And I don't want even to talk about this terrible performance using its default serialization scheme. So okay, we still have a little, like not 15 minutes, so I can probably return back. Yes, okay, so we stopped at I think this or this.
31:27
Yeah, this one. So Luigi, as we just saw, Luigi uses, by default it uses a serialization scheme
31:47
which is really, really slow. So this is how you can switch to JSON and I didn't really have time to investigate also
32:01
but after switching to JSON, I needed to specify an encoding by hand. So I don't know, it's also something to keep in mind. And don't forget to install ultra JSON because by default, Luigi falls back
32:21
to the standard library's JSON which is slow. So okay, pros and cons. Luigi is the only real framework that concentrates on the workflow in general. It provides a central scheduler which has a nice dependency graph of the whole workflow
32:44
and it records all the tasks and all the history so it can be really useful. It is also in very active development and it has a big community, not as big as Mr. Job but still very big.
33:02
It also automatically uploads itself to cluster and this is the only framework that has integration with Snakebyte which is awesome. Just believe me. It provides not so good local testing compared to Mr. Job because you need to mimic
33:23
and map and reduce functions by yourself in the run method which is not very convenient and it has the worst serialization and deserialization performance even with ultra JSON.
33:42
So the last of Python frameworks that I want to talk about is PyDupe. Unlike the others, it doesn't wrap Hadoop streaming but uses Hadoop pipes. It is developed by CRS4 which is the central for advanced studies research and development
34:01
in Sardinia, Italy. And this is an example of word count in PyDupe which looks very similar to Mr. Job but unlike Mr. Job or Luigi, you don't need to think about different serialization and deserialization schemes. Just concentrate on your mappers and reduces on your code
34:23
and just do your job. So it's cool. Okay, so pros and cons. Okay, okay. I'll do my best.
34:40
So PyDupe has pretty good documentation. It can be better but generally it's very good. Due to the use of Hadoop pipes, it is amazingly fast. It also has active development and it provides an HDFS API based on lib HDFS library
35:02
which is cool because it is faster than the native Hadoop HDFS command line client but it is still slower than snake byte. I didn't benchmark this but Spotify guys claims that it's slower. So and it is slower because it still needs to spin up
35:24
an instance of JVM so I can't believe them that snake byte is faster. This is the only framework that gives an ability to implement record reader, record writer, petitioner in pure Python.
35:42
And these are some kind of advanced Hadoop concepts so we won't discuss them but the ability to do that is really cool. The biggest con is that PyDupe is very difficult to install because it is written in C Python and Java.
36:00
So you have to have all the needed dependencies plus you need to correctly set some environmental variables and so on and I saw a lot of posts on Stack Overflow and on other sites where people just got stuck on installation process. And probably because of that PyDupe has a much smaller
36:23
community so the only place where you can ask for help is a GitHub repository of PyDupe. But the authors are really very helpful, they are cool guys so yeah. The answer to all the questions and so on.
36:43
Also PyDupe doesn't upload itself to cluster like other Python frameworks do so you need to do this manually and it's not so trivial process if you're just starting to work with Hadoop.
37:03
So this is it. So Pig, Pig is an Apache project, it is a high level platform for analyzing data.
37:22
It runs on top of Hadoop but it's not limited to Hadoop. This is a work count example using Pig. It will be translated to map and reduce jobs behind the scenes for you and you don't have to think about what is my mapper, what is my reducer,
37:43
you just write your Pig scripts. And also most of the time in real world use cases, Pig is faster than Python so this is really cool.
38:01
It is very easy language which you can learn in a day or two or something. It provides a lot of functions to work with data to filter it and so on. And the biggest thing is that you can extend Pig functionality with Python using Python UDFs.
38:23
You can write them in C Python which gives you access to more leaps but it's slower because it runs as a separate process and sends and receives data via streaming.
38:41
And you can also use Jython which is much much faster because it compiles UDFs to Java and you don't need to leave your JVM to execute your UDF but you don't have access to libraries like NumPy and you know, SciPy and so on so yeah.
39:01
This is an example of Pig UDF for getting a JU data from an IP address using a well known library from MaxMind. It may seem complicated first but it's not actually.
39:23
So in the Jython part at first we import stuff, some stuff from Java and the library itself. Then we instantiate the reader object and define the UDF which is simple.
39:40
And it accepts the IP address as the only parameter and then tries to get a country code and see this geo name from a MaxMind database. It is also decorated by the Pig's output schema decorator and you need to specify the output of the UDF
40:03
because Pig is statically typed. And as for the, then we put this UDF into the file called goip.py and as for the Pig part, we need to register this UDF first and then we can simply use it as shown like here.
40:23
So it's really simple concept when you get used to it. Yeah. There is also a thing called embedded Pig. This one. So we already saw benchmarks. So conclusions.
40:41
So for complex workflow organization, job chaining and HDFS manipulation, use Luigi and Snakebyte. This is, yeah, this is a use case where they really shine. Snakebyte is the fastest option out there to work with HDFS.
41:01
But you have to fall back to native Hadoop command line interface, of course, if you need to write something to HDFS. But just don't use Luigi for actual MapReduce implementation. At least until performance problems won't be fixed. For writing lightning speed, MapReduce jobs
41:21
and if you aren't afraid of difficulties in the beginning, use PyDube and Pig. These are two fastest options out there except for Java. The problem with Pig is that it's not Python so you have to learn it. It's a new technology to learn but it's worth it.
41:41
And PyDube, while maybe it is very difficult to start using it because of the problems or installation and so on, it is the fastest Python option. So it gives you an ability to implement record-reducing writers in Python which is priceless.
42:05
For development, local testing or perfect Amazon's EMR integration, use Mr. Job. It provides best integration with EMR. It also gives you the best local testing development experience compared to other Python frameworks.
42:25
So in the conclusion I would like to say that Python has really, really good integration with Hadoop. It provides us with great libraries to work with Hadoop. Well, the speed is not that great of course
42:42
compared to Java but we love Python not for its speed but for its simplicity and ease of use. And by the way, if you are wondering what is the most frequently used word in Mark Ladd's book on Python without counting things like prepositions, conjunctions and so on,
43:02
this word was used 3,979 times and this word is of course Python. So this is all I got. You can find slides and code. I used for the benchmarks on slideshow and GitHub. So thank you.
Recommendations
Series of 2 media