Add to Watchlist

Big Data with Python & Hadoop


Citation of segment
Embed Code
Purchasing a DVD Cite video

Formal Metadata

Title Big Data with Python & Hadoop
Title of Series EuroPython 2015
Part Number 104
Number of Parts 173
Author Tepkeev, Max
License CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
DOI 10.5446/20163
Publisher EuroPython
Release Date 2015
Language English
Production Place Bilbao, Euskadi, Spain

Content Metadata

Subject Area Computer Science
Abstract Max Tepkeev - Big Data with Python & Hadoop Big Data - these two words are heard so often nowadays. But what exactly is Big Data ? Can we, Pythonistas, enter the wonder world of Big Data ? The answer is definitely "Yes". This talk is an introduction to the big data processing using Apache Hadoop and Python. We'll talk about Apache Hadoop, it's concepts, infrastructure and how one can use Python with it. We'll compare the speed of Python jobs under different Python implementations, including CPython, PyPy and Jython and also discuss what Python libraries are available out there to work with Apache Hadoop. This talk is intended for beginners who want to know about Hadoop and Python or those who are already working with Hadoop but are wondering how to use it with Python or how to speed up their Python jobs.
Keywords EuroPython Conference
EP 2015
EuroPython 2015

Related Material

if the hopefully interesting so hello welcome everyone I hope you
enjoy your Python as much as I do and for the next 45 minutes you can just sit relax and enjoy the talk about the data with Python and Hadoop some slides are already in slideshare . net and I'll give you the link at the end of the talk and this is our agenda for today at 1st a quick
introduction about me my company you get an idea about what do we use for then a few words about the data it back to you but it's ecosystem next will talk about RDFS and third-party tools that can help us to work with the defense after that will briefly discuss MapReduce concepts and talk about how wikinews Python with Hadoop what options do we have a like was hit by the libraries are there Python of course about the pros and cons next we briefly discuss a thing called pig and finally will see the benchmarks of all the things we've talked about earlier the freshly baked benchmarks which I a week ago just before coming to you by Python and they are actually quite interesting
everquest conclusions by the way can you please raise your hand school knows what had you please working with Hadoop or maybe worked with Hadoop and past OK OK thanks not too much so this is me
and my name is max in Moscow Russia and I'm the author of several Python library so there's a link to like it have a diff you interested elsevier talks from different cost conferences from time to time and contribute to other . 11 I work for the
company called we collect and process online and offline use data to get the ideal fuses interests intentions demography and so on In general process like more than 70 million users per day there are more than 2 thousand segments now database like users who I interested in buying the BMW car wall users who like dogs or maybe users who watch porn online you know we have partners like Google DBM access and many more we have quite a bit
worldwide user coverage in we process data
fool more than 1 billion unique users in total we have 1 of the biggest user coverage Russia and Eastern Europe for example fresh it's about like 75 per cent of all users to having said all that you can see that we have a lot of data to process and we consider ourselves the data-driven company award at the data company like some people like to call it now so what exactly is the data there is actually a great quote by Dan Ariely about the data the data is like
teenage sex everyone talks about it nobody really knows how to do it everyone thinks everyone else is doing it so everyone claims that doing the nowadays actually BJT mostly marketing interim reward buzzword actually there is even a tendency off arguing
like how much data is being generated in Plano different people tell different things in reality of course on if you have real big data like Google something but to keep it simple for the rest of people the data can be probably considered big if it doesn't fit into 1 machine or it can be processed by 1 machine or it takes too much time to process by 1 machine but the last window can also be a sign of peak problems cold not big data now that we figured out that we probably have a D data problem we need to solve it somehow this is where Apache Hadoop comes into play
Apache Hadoop is a framework for distributed processing all evolved large datasets across clusters of computers it's often used for batch processing and this is a use case where it really shines it provides linear scalability which means that if you have twice as many machines jobs to run twice as fast and if you have twice as much data job to twice as slow he doesn't require supercool expensive hardware it is designed to work on unreliable machines that are expected to to fail frequently doesn't expect you to have knowledge of interprocess communication or threading obvious your network programming and so on because spellings ecution of all of the whole cluster is handled through transparent and who has a
giant system which includes a lot of projects that are designed to solve different kind of problems and some of them are listed on this slide what is we didn't 15 HDFS and MapReduce are actually part of the ecosystem body part of a group itself in will talk about them on the next slides In will also discuss big which is a high-level language for parallel data processing unit measure had to I won't talk about the others because we simply don't have time for it
so if you're interested you can do this for yourself FIL
HDFS stands for Hadoop Distributed File System it just stores files and folders chunks files into blocks and blocks of scattered randomly all over the place by default block 64 megabytes but this is configurable and it also provides a replication of blocks by default 3 replicas of each block are created but it's also configurable defense doesn't allow to any files on the create read and delete because it is very hard to you know implemented any functionality in distributed system with applications so what they did was just you know why why bother and implemented editing files when we can just make the nite people and it provides a command-line interface to HDFS but the down said the
downside of this that it is implemented in job and it needs to speed up the JVM which takes up from 1 to 3 seconds before common can be executed which is a real pain especially if you're trying to write some scripts and so on but thankfully too great guys
modified there is an alternative called snake bites its energy affairs client we compute Python it can be used as a library in your Python scripts common blank lines it communicates with how do we we are perceived which makes it amazingly fast much much faster than native of command line interface and finally it's a little bit less to type it secure comment so python for the win but there is 1 problem snake bite doesn't handle write operations of the moment so while you're able to make me to operations like moving files and then you can try filed to age differences in snake bites but it student very active development so I'm sure this will be implemented some point
this is an example house nearby can be used as a library in Python scripts it's
very easy we just in Brookline connected do and start working with defense the really amazing simple there is
also think all ofyou using a web interface the analyzing data with do it provides for some HDFS file browser this is how it looks like you can do everything that you can do 3 native fish defense Common command line interface using here it also has a job browser the design forward for jobs so you can develop excretes and ball high queries and lot of a lot of most of its support Suzuki Peruzzi and many more and I'm going to details about you because again we don't have time for for this but this is a tool that you laugh if you don't use it to try name by the way it's made of on top of Python and Django so again by then for the win so now when we
know how do stores its
data we can talk about MapReduce it's a pretty simple concept the mapper saying users and you have to code both of them because they are actually doing data processing what matters basically duties the log data from HDFS the transformed FIL all prepared this somehow and output a pair of key and value matters output thing goes to reduce but before that some magic happens inside had to and members output is by key this allows you to do stuff like aggregation counties searching and so on and the reduced so what you get in the reduced so is the key and all values for that key and after all reduces complete the output is written to HDFS to actually the workflow between mappers and reducers these will be more complicated there also but those also shuffle phase sort sometimes 2nd resort to combine as petitioners and load of different other stuff but we won't discuss that at the moment is a matter for us it's perfectly fine to consider that there is just a on the mappers and reducers sensor magic is happening between them now let's have a look at the example of MapReduce will use the canonical work on example that everybody uses so we have a text used as an input which consists of 3 lines by cool Hadoop a school and job is bad this text will be processed by the where you agree to be used as an input which consists of 3 lines so the 2 bourses line-by-line like this and inside a matter blind was put it into words like this 2 for each word in the map function then functions will return a digit 1 and it doesn't matter if we made this these were twice or 3 times which is not way digit 1 then some magic happens provided by do and inside the reduce would get all values for forward group by these words so we just need to use some of these values in the reduced to get the desired output this this may seem an intuitive for complicated 1st but actually it's perfectly fine and when
you when you just starting to do mapreduce so you have to make your brain thinking in terms of MapReduce and after you get used to it the it's although become very clear so this is the final result for our job now let's have a look
at how our previous work on example will look like in
Java now you probably understand why earns a much much money when you code in Java because more typing means more money and you can imagine like how how much code you should drive for it we word cases and job so now after you've been impressed by the simplicity of and let's talk about how we can use Python we do I do doesn't provide a way to
to work with Python nature two-week uses a thing called Hadoop the idea behind behind this string thing is that you can supply and executable to Hadoop as a matter already use it can be standard unit tools Unix tools like cattle you need or whatever old Python scripts all perl scripts still rugby on the beach you would like whatever you like to the the executable must read from standard and write standard out this is a gold for mapper and reducer the matter is actually very simple we just read from standard input so line by line and was put into in out of the way in which it was using a tap as a default separated because the it's a default entity greater you can change it if you like so 1 of the disadvantages of fusion string interact directly is there's uh don't want to to the reduced um I mean it's it's it's not grouped by it's it's coming line by line so you have to figure out the boundaries between the key used by itself and this is exactly what we do here in varied user we're using in a group by and in groups multiple word count pairs by words and it creates an iterator that regions consecutive keys and the group so the the 1st item is the key in the the value the 1st item of a value is also the key so we just feel to it we use an underscore for it and then we cast In a value to reading to sum
up some of and it's it's pretty awesome compared to how much you have to type in jobless but it's the maybe like a little bit more and not be complicated because of the manual work and the reduced this is a common
which sense of MapReduce job to hadoop their Hadoop-streaming
and we need to specify Hadoop Streaming John M. depositary mapper and reducer using the method and arguments and input and output 1 interesting thing here is to to file arguments where specify the posture MapReduce again and we do that too to make a due to understand that we wanted to all these 2 files to the whole cluster to it's called Hadoop Distributed cash it's a it's a place which stores all files same resources that are needed to run a job and this is really cool thing because imagine like you have a small cluster of for machines and you just reality pretty cold job manuscript for a job you use the next library which is not installed on your cluster obviously so you if you have like fashion you can blogging into every machine and install this library by hand but what if you have the cluster like of 100 machines for the wife of 1 thousand machines they just won't work anymore of course you could create some some best creep something that could do the automation for you but why do that if Hadoop already provides the way way to do that so you just specify what you wanted you to to copy to the whole clustered before the job will start and that the man also after the job complete and you will delete everything and your cluster will be in its initial state again it's pretty cool after our job is complete we get the desired results so Hadoop Streaming is really cool but it requires you to do the little bit of extra work in though it's still much simpler compared to job well we can simplify it even 1 with the help of different path and frameworks for what working with Hadoop so it's due the overview of them the 1st 1 is done below if it was 1 of the earliest Python frameworks will do but for some reason it's not developed more there's no support no downloads so just let's forget about it there is added to the Dubai and it's the same situation as with down below the the project seems to be abandoned and there are still some some people trying to you to use it according to you by the i downloads so we use you want you can also tried I don't know there is applied to it's a very interesting project while other projects I just wrappers around Hadoop Streaming 5 visa use a thing called Hadoop Pipes which is basically a C + + API to Hadoop and it makes it really fast will will see this is also a Luigi
project it's also you could lead to was developed at the sporophyte is maintained by sporophyte its distinguishing feature is that he has the duty to build complex pipelines of of of jobs and support many acknowledges which can be used to run the chops and there is also a thing called an odd jobs it's the most popular Python framework for working with Hadoop read was developed by the Open Source a cool but there are some things to keep in mind while we're working with so we talk about by deeply EEG and in my job In models next slide
so sold the most popular frameworks is called MapReduce job or MR job on Mr. job like some people like to call it so I was like this from Mr. Jaffe is a wrapper around Hadoop streaming and it is actually developed by the Open maintained by and inside this is how our work account example can be written using Mr. Joppa it's even more more comfort so while while map looks actually the same as in the the role of Hadoop streaming just notice how much typing we saved in the reducer but behind the scenes actually Mr. Joppa is doing the same group by aggregation which is sold previously in the and its training examples but as I said there are some things to keep in mind Mr. job uses socalled Chronicles for for data serialization deserialization and between phases and by forward use Jason protocol which itself uses Python's Jason library which is kind of a slow and so the 1st thing you should do is to install simple Jason because it is faster war starting from Mr. job 0 0 . 5 . 0 which i think still in development it
supports Alton Jason library which is even more faster through this is how you can specify this of tradition particle and again this is
available on the starting from 0 . 5 1 0 lower versions use simple Jason which is slower the Mr. job also supports love article which is the fastest particle available but you have to take care about the realization this realization by itself is as shown in this slide so notice how we we cost 1 to stream in the mapper and some to streaming a reducer also we the introduction of Fulton Jason 18 in the the next version of Mr. job I don't think there is a need to
use this what particles because they're not so much faster actually compared to alter Jason in at least most of the time of course it depends on the job in so you have to experiment for for yourself and see what the is the is best for you so Mr.
Jefferson const 8 in my opinion it it has like that's documentation compared to other path and frameworks it it has the best integration we Amazon's you mine which is Elastic MapReduce end can compared to other Python framework because yeah uses that you did operates inside the Marcel it's understandable it has very active development begins community it provides really cool cool local testing without had which which is very convenient of well during development and he also automatically outlawed itself to cluster indeed it supports multi-step jobs we which means that what 1 job 1 job that will start only after the 2nd another 1 is successful finished where you can also use bash acuity so chart files or whatever in this modest workflow the the only downside that I can think of these various lol serialization digitalization compared to Python stream but compared
to how much typing saves you we can probably from the rate for that so this is not not really because the next the in our list is
Luigi wages also air wrapper around streaming and it is developed by sporophyte this is how our work out example can be written using the region is we'll be more variables compared to Mr. job just region concentrations mainly on the total work for all and and and not on the on on on a single job and it also forces you to to define your input and output inside a class I and not from a common line interface as for for the map and reduce implementation then absolutely insane 4 minutes left on I have so much to the 4 minutes OK OK so so leaves you also has this problem with the realization this realization and all you also have to use all to Jason just just use ultra Jason and everything will be cold and OK so will probably
keep that it's also Coolidge's cool and but not so good for local testing and we also keep
ideas OK OK OK man
alright alright OK benchmarks so
this is the most important part their here this is probably where a lot of people of it before the benchmarks um so we this is a cluster and and software and have that I used to do the benchmarks and so the the the job was a simple word count on the well known book about the Python by Mark uh my lots and and and multiplied at 10 thousand times which gave me 35 gigabytes of data in I also use combining between a map and Reduce phase so we combine is basically a local reduce which just runs after in them and that phase and it is kind of an optimization so this is the um this is
stable um jobless fastest of course there no surprise here too it is it is used as a baseline for performance all numbers for other frameworks are are I just reaches and relative to jump values so for example we have a job runtime for for like what 100 in 87 seconds which is 3 minutes and something and to get the number 4 by duping each multiplying 187 by 1 . 86 which will give you 387 47 seconds disease almost 6 minutes so each chop I rented job 3 times and the best time was taken in the so it's diff discuss a few things about this this performance comparison so do this is the 2nd after Java because it uses these Hadoop hadoop Pipes C + + API it still takes almost twice as slow as compared to the native job but another thing that may seem strange is the 5 . 9 7 ratio in the reducing projected so it looks like the combined student-run but there is an explanation to that implied of manual the it it says the following 1 thing to remember is that the current Hadoop Pipes architecture runs the combiner under the hood of the executable 1 by pipes so we doesn't have the
the combiner counters of the general Hadoop framework to to his wife why we have this but then can speak
actually felt that it should be the 2nd of 2 jobs and before I ran this benchmarks but and fortunately I didn't have have really time to investigate the reasons 2 edges counts EYE slower because people think translates itself and the jobless social training almost as fast as just they then comes up Ross streaming under Python pipeline and you probably maybe may be surprised that pipeline no OK you have any questions so I just can continue the audio data so yes so it is actually I'm I'm speaking for a half a half an hour and this is a 45 In this talk so it has to have 15 minutes out of their own it's I know so that's no questions that is so get there by by yes you you probably probably maybe a bit
surprised that by slower but actually the thing is that it's the way the word count is there really simple simple job and applied twice these um this is is currently slower than C when dealing with reading and writing from standard standard out so it really depends on the job in in a real world use cases pipelines actually a lot more faster than C Python so what we usually do we
implemented a job and then then we just trying to on pipeline intensify fan and see what's the difference and like I said in most cases by by wins so just just drive for yourself and see what it's best for you um the incomes Mr. job some and then as you see ultra Jason is just a little bit slower than then this rock particles and but it it saves you the saying of dealing we manual work soldiers I think he is altered Jason finally Luigi which is much much slower even with or without education and Mr. job and I don't want you to to talk about this terrible performance using its default serialization scheme so OK we still have a little like not 15 minutes so I
can probably return back OK so
we stopped it I think this part of this path yes this
1 so I it will lead you as as we just saw on some users stadium but by default it uses this is the real serialization scheme which is really really slow so this is how you can then then switch to to to to Jason end I didn't really have time to investigate also but um after after switching to 2 adjacent I needed to specify an encoding by by hand so I don't know I don't have something to to keep in mind In and don't forget forget to install ultra Jason because by default uh lead falls back to the standard libraries Jason which is slow so OK pricing cons and the EEG is the
only real framework that concentrates on the on the workflow in general and it provides a central scheduler which has a nice dependency graph of of the whole workflow and each records all the all the tasks and all the history so we it can be really useful and it is also in very active development in and it has a big community not as big as Mr. job but still very big um it also automatically upload itself to cluster and and this is the only framework that has integration with snake bites which is also just believe it it provides not so good local testing compared to Mr. job because you need to 2 2 2 meanings and map and reduce functions by yourself in the Ron method which is not very convenient and it has the worst serialization and this realization performance even without adjacent which
to the the lost of Python frameworks that I wanna talk about supply due like the others it doesn't trap Hadoop streaming of but he's had the pipes it is developed by C R S 4 which is essential for Advanced Science Research and Development in studying in Italy and this is again an example for the word count in implied reach which looks very similar to Mr. job but unlike Mr. job or Luigi you don't need to think about different serialization and serialization schemes just concentrate on a map person produces on your code and just do job to call OK so
prussic answers OK OK and members and so but it has pretty good condition in it's it can be better but it generates it's very good and due to you the use of how do do pipes it is amazingly fast but it also has a has has a active development arm and it provides an HDFS API based on the media affairs library which is cool because it is faster than the native had you fast common blank line but it is still slower than snake bites I benchmark these but support if I guys claims that it's always
so um and it is slower because it still needs to do to spin up an instance of JVM so I can't believe them that's a but fast and that this is
the only framework that gives an ability to be implemented and corrected redirected rider petitioner impute Python this was some kind of advanced cont'd concepts so we we want discuss them and but the ability to do that is really cool the the biggest colonies that by p is very difficult to install because it is written in the Python and Java to you you have to have all the needed dependencies um bless you need to correct this set some
environmental variables and and so on and I saw a lot of posts on StackOverflow in from other sites where people just just got stuck installation process and probably because of that by
Duke has a much smaller community so the only place where you can ask for help this there is a key top repository of by the you know the the authors are really they're helpful they're cool guys so did you get the answer to all the questions and so on and it also you had a but by G doesn't upload itself to cluster and like other parts and frameworks do so you need to do you to do this manually and it's not that not not not so trivial process if you just starting through to work with Hadoop so this is a I so the the
key is it is an Apache project is a high-level platform for the analyzing data he eat it draws on top of Hadoop but it's
not limited to do and this is the word count example using big um the will be translated to map and reduce jobs behind the scenes for you and you just you you you don't have to think about like with what is my map what is my reduce reduces you just write your your peak scripts and also in most of the time and in in in a you know real world use cases big big is faster than Python so this is this is really call it is very the language which you can can learn in a day or 2 or something it it provides a lot of functions to work with data to future reading and and so on but and and and the the biggest thing is that you can extend it functionality with Python Using Python you can write them in in C Python which gives you exist to Molly but the slower because and the trans runs as a as a separate process and sends and receives data we as training and you can also use Ji which is much much faster because it compiles UTS to job and you don't need to leave the JVM to execute but you don't have access to libraries like number line and no sigh part and so on so yeah this is an example of the QDA have and for so
again it you data from an IP address using the well-known library and from max mind it it may seem complicated 1st but it's not actually um so the in the in the in the in the giants part at 1st we we import stuff some stuff from July and in the library itself then we instantiate the reader object and defined the idea which is which is simple and it except the IP address as the only parameter and then tries to get a confocal and cities June name from a amongst mind database it is also degraded by the by the peaks of output scheme integrator and you need to specify the the output of the UDF because because statically typed and as for the then we would put this UDF into the file called you by an end in as for the peak part we need to register this year 1st and then we can simply use it as shown like here so it's it's it's really a simple consequence you get used to it yeah there is also a
single and the peak the of this 1
so we we already saw benchmarks conclusions
so for complex workflow
organizations job training and HDFS manipulation Luigi and snake bites this is yeah this is the use case where they really shines snake bites this is the fastest option out there to work with age difference but you
you have to fall back to native 100 common line interface of course if you need to write something to HDFS but just don't use Luigi for actual MapReduce implementation at least and until performance problems 1 the fixed for writing at
lightning speed MapReduce jobs and if you
are freedom of difficult in the beginning he's spider man and pigs this us to to 2 fastest options how they accept the job the problem with peak is that it's not Python so you have to learn it its new technology to learn but but it's worth it and applied to while maybe and very difficult to start using it because of the problem so installation and and so 1 is the fastest Python options so
integers inability to to to implement check readers
and writers in Python which is priceless for
development local testing or perfect Amazon's my integration use Mr. job it provides the best integration with mark it also gives you the best local testing development experience compared to other parts and frameworks um so
in the conclusion I would like to say Weiss Python has them really really good integration with Hadoop it's provides us with great libraries to work with Hadoop well that the speed is not that great of course compared to job but will of Python not for its speed but for its simplicity and ease of use and by the way if you are wondering about what is the most frequently used words in in my lots Brooklyn Python and without count and things like prepositions conjunctions and so on this word was used 3 thousand 979 times in this work is of
course Python so this is all I got and deconfined slides and
cold and values for the benchmark some slight change departed help so thank you be filled
at a time and
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation


  632 ms - page object


AV-Portal 3.8.2 (0bb840d79881f4e1b2f2d6f66c37060441d4bb2e)