Add to Watchlist

PySpark - Data processing in Python on top of Apache Spark.


Citation of segment
Embed Code
Purchasing a DVD Cite video

Formal Metadata

Title PySpark - Data processing in Python on top of Apache Spark.
Title of Series EuroPython 2015
Part Number 115
Number of Parts 173
Author Hoffmann, Peter
License CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
DOI 10.5446/20183
Publisher EuroPython
Release Date 2015
Language English
Production Place Bilbao, Euskadi, Spain

Content Metadata

Subject Area Computer Science
Abstract Peter Hoffmann - PySpark - Data processing in Python on top of Apache Spark. [Apache Spark] is a computational engine for large-scale data processing. It is responsible for scheduling, distribution and monitoring applications which consist of many computational task across many worker machines on a computing cluster. This Talk will give an overview of PySpark with a focus on Resilient Distributed Datasets and the DataFrame API. While Spark Core itself is written in Scala and runs on the JVM, PySpark exposes the Spark programming model to Python. It defines an API for Resilient Distributed Datasets (RDDs). RDDs are a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs are immutable, partitioned collections of objects. Transformations construct a new RDD from a previous one. Actions compute a result based on an RDD. Multiple computation steps are expressed as directed acyclic graph (DAG). The DAG execution model is a generalization of the Hadoop MapReduce computation model. The Spark DataFrame API was introduced in Spark 1.3. DataFrames envolve Spark's RDD model and are inspired by Pandas and R data frames. The API provides simplified operators for filtering, aggregating, and projecting over large datasets. The DataFrame API supports diffferent data sources like JSON datasources, Parquet files, Hive tables and JDBC database connections.
Keywords EuroPython Conference
EP 2015
EuroPython 2015
hi everybody thanks for the introduction adjustment of money and you can see much handle at the door of money and you can find after votes the slide
at the top of the constellation we and and the focus of my talk from a little bit about
me and about and so what did you I myself that developed at the end of the and is providing predictive analytics as a
service and I think was more than 100 data scientists we 1 of the biggest data science team and Germany was that is mostly private army Andrea and building a platform where we run our machine learning algorithms on top of it you see here we are attempting to perform we under the Europe I and we we have like talks so often you have still have the chance to see 3 other people from the end and you can see tomorrow more it's talking about testing and fuzzy testing you can see Christiane talking about all the task knowledge was equalizer me and you can see Florian and talking about what's and last but not least the live in this thing I think you represent would be under really does so let's start in
some about what is a spark so spiders distributed general propose computation engine light has API to scholar Java R and Python and it's mostly from machine learning and distributed computing um has 1 core API that the resilient distributed data set and based on this core API all other API is sitting on top and spot is in that runs on the cluster it on multiple machines and you can use different different schedule was to run spot on a class that like a stand-alone scared
you know you can use to do well on the on schedule or you can run spark and measles on top of popcorn they're sitting several libraries
the important although the most important 1 is box equal all the spot DataFrame API then those spots streaming a kind of but calculate based on micro batches and you can do with stream computing there's the MLE library for machine learning and that's the graphics library which is for graph processing spot itself is written in skyline runs on the little machine as you have orbital machine and it is responsible for the memory management for fault recovery and interactions with other storage systems the box sits on top of the 2 decks so spark can access every data source that the United States provides and then the so the core
arm library of Spartacus the RTD the resilient distributed data sets and the idea is that logical
plan to compute data based on other datasets bodies are fully fault-tolerant so that the system can recover from the loss of single nodes in your class or or from single failures in a calculation of parts of your body will then we run on the calculation and try to recover from a machine failure there are 2 basic principles of how you can interact with allergies the first one is who transformations so a transformation always takes 1 or more entities as an input and has an idea isn't outputs transformations are always lazy and that means they are not calculated on the fly but they are calculated when you are on call an action on an and and action on the bill the last step in the in the calculation plan where we want to collect the data so you can take some rows of your data you can get all you can call count the result that and then the data for the calculation will really be run and you'll get a Q of data sparked tries and to minimize the data shuffling between the nodes in your plaster and in contrast to the head you perspective it doesn't write all intermediate results to the to the file system but it tries to keep them in memory and therefore spot if you data fit into memory in the memory of class which must it's much faster than traditional elementary news texts if you combine
multiple transformations of this UID these you get they ordered the linage graph that means I get based on your partititions some of your input data you can have a lot of transformations 1 or 2 after another and spark tries to group these transformations together and when possible run them on the same node many transformations of element not wise that means they can only work and an element at a time and but it's tool for all operations operations like by all joined them operations work on multiple elements and as I said earlier connections are then used to get the results back and to return it to you trial program dose if you know
I do a traditional MapReduce some programming
model The only mad and reduce that while spot has much more transformations it has the map and the reduce some computation but it also have things like flat may have filtered out as a simple function you can do unions of multiple datasets you can do an intersection can a group of the data by keys can aggregated by keys and you can do fully joined the knowledge of right out of joint a few data sets what's important for a spot is that it knows the partititions review input files and knows knows the data locality of your partitions because it always tries to run your calculations male the data is so spock tries to bring the organisms to your data and minimize the shuffling data around in your cluster so you have a set of partititions which atomic pieces of your data set and you have a set of dependencies
based on current allergies and you've always functions which will calculate monarchies based on your parents Spock needs to know about the metered out of your
data to know where you are data is located to be able to do data local computation so that they shuffle it is expensive and which will really slow down your calculations is only done when it's necessary as a set
earlier this plot is implemented in part and in scholar and runs on the job of the tool machine so what is priced applies but is
a set of binding so API is which sits on top of the spot programming model and
which exposed the programming model to your Python programs so that's the famous but can't example what you do is you always start with some was an input that some kind of basic file system operations right you can load a text file from 1 HDFS and then you just the moment MapReduce steps very useful at the lines by white space then you will image of each result was a number and then you do it reduced meant where you calculate the of the books um as Titan is dynamically typed and it's possible that your body these can
hold objects of different types and that's not possible in the Java version but the scholarly also has this possibility at the
moment PPISpotter not supports all API that are supported in the sky Rosen for the data frame there are nearly provide everything but for streaming by spot always selects 1 or 2 lessons behind the moment scholar at the eyes so here you see
how it's done you always have driver context that's on your local machine and you can official later you can have an iPod professional around you moment Python program this will then connects to a spot contexts which we'll talk a 40 but if you have a little machine on your host which will then talk to the workers and each worker will again talk to highest or to eat the JVM it depends what kind of calculations you do on
top of the body DC there's the relational data processing and spark that's
a relatively new API it's a has been added to the API since Part 1 but for only 2 months ago and but it's a new kind to we welcome on a higher level of this year data food declarative queries and optimize Storage Engines it provides a programming extractions call up abstraction called beta frames and it also acts as a distributed sequel query engine from emulsion later how you into and what's really nice thing is that the query optimizer because optimizer works the same for the last
and Python so you will gain the same speed in your Python programs as it will gain is has got and the data API right so which set of relational operations so you can into acts through different API is Mr. data-frame API to can connected so to seems so Adam
Gallup along can talk to fool the normal JDBC API you can directly talk to through user programs in Python job scholar and you can also switch between the data-frame API and the role of every DAPI so what's a data
frame a data frame is a distributed collection of
rows who Internet columns was a schema and high-level API for common data-processing talks that's projection filtering aggregation join and it has made about our sampling and user-defined functions so you can define your user defined functions in Python and use it in the sequel statements in the sequel queries as was the ID these data friends are executed lazy and that means each data frame object also only repents logical plan how to compute the dataset and computing is I just hold on until you really call and prediction a data frame
you can see it as an equivalent to a relational table and sparks equal and you can create it through various function using a sequel to the context um and then once you've created it you can also operated on its School of declarative domain-specific language so here we just we we just loaded but so people's Jason file which has some roles of Jason and then you can like you of Norwood forms equalizer mean or maybe from and as you can do filtering selecting projection um and get the data back and if you
compare these 2 statements so the 1st 1 is in the decorative pies
read the 2nd 1 is a sequel way and they result in the same and execution plan on or in spot itself so it's only decorative if it's a variety of Python you get the same speed as the sequel from 1 all the when you define a scholar and
otherwise it's possible it's possible because podcast the catalyst the query optimization framework which looks for all other languages which use the data-frame API and it's implemented in Scala and uses of features like pattern matching and runtime into programming to allowed develop loss and to specify complex relations and relational optimizations and as you can see that from the spot right side and if you will Officer plane ordered these the pipe motion always lower than the scholar versions but if you sit on top of the data frame API we only use declarative statements then you get the same speed up s in Scotland so how
do you talk to you the data on as I said before sparked works on top of Hadoop so you can access although I do have some file system and drivers that are available their school the data source API so you can talk to Hive tables you can read in of rule files CSV file Jason fires you can read and store the data in the park a column of folk from model and you can also connect to normal JDBC databases and I'm go
into little into detail into the parking data from a lot them because I think that's
a really great way to to work and store data from spark so by is a column of former it's somewhat supported by many data processing systems and you can store all of the parking data for marketing into an huddled HDFS file system and parking came automatic itself the schema of the original data and as you can see here if you have a table with 3 columns the moment orientated so is that you right each roll of the rope and the column oriented storage is that you will save your data in column order and that had differ had several advantages the first one is normally in 1 column this simulated data so compression looks much better if you rock and code column-wise data and then if you have a data out many columns and you don't want to exist all columns every time it's much faster access to just Marxist some columns at the time the
data-frame API is able to do prediction and projection pushed on that means if your underlying storage and is able to of work was vertical partitioning or essential partitioning despite the different maybe I can push on the predicate to use storage engine and don't have has to read all the data into spot but let the storage engine do the hard work so can see here with a couple of partitioning the means you only wants to have the column B and maybe you have some predicates on some roles so you will see say OK I want only 1 that the rows where a is a tool and using C 4 and 5 so you can split this up and only read the result into spot to for further processing this
DataFrame API not only supports the view of data it has a basic types like numeric types to entice by types but it also supplies some provide support for complex types and nested types so you can build build tree-like data exist like data from the data API you always
have to provide a scheme now all the data has a schema there are 2 ways to get the schema into your data frame the first one is going to do that schema inference and the smallest this type input data like of adjacent files always so monarch symbolistic fires it can guess some the data types for your data frame or
and you can specified in the disk you by yourself so you really want to read in file and we define a stroke type is several fields and 3 x to the fields and the type of the field this is a call and to see what are the important classes of this box equal and data friends so we have the context of the main entry point to interact with the data from mystical like functionality you have the data frame which is a distributed collection of course data and you have
columns of time expressions which you can work on your data frame a a rule in the data frame and group data is data you get from aggregation aggregation methods like by entering efforts and the that of the types that describe the
schema so when you have a data frame you know what it's like looks like and I think you
can select you can filter you can group by and we're work going to data and you would do well not not local machine but you really work on a cluster and that's what I want to show some solo show you little example it's from that the key tap are she's so they could have he
stalls all the events that are on the top of the for the last half year 7 that 2072 gigabytes of treason data about 17 million events and now all my colleagues said don't do it but we'll try it anyhow and to do a life in more and connect to the cluster the so
that's a little bit too small but I think we get it also the will I also wanted to
show you the class when this will
not work at all so I'll try to only goes with programming statements so we always start with a sequel context that's all interest point that we connect to the cluster fewer cluster was for machines each 40 costs and about 1 10th of either from in in total so what we do we just want to read a single text file adjacent file that's 1 hour from 1 day into into the cluster you see it here how to get it you say in this context get the text files i take 1 so will dump it all what we see it's non-adjacent but it's in here radical changes but as the stability of problem because part can work with and right that now we read it in as the change so it all to detect the schema you can see that of so now we read in the data and start ordered index the schema so what we see for each event we have and an actual that's the the person who committed to them to the attitude it to get out we have a graded this version we have some payload and we have the type of event at the bottom like a pull request or something like that and that's spot automatically detect the the mass so we don't have to do anything so now let's try to look the whole data only 1 hour from 1 day but welcome all the events of the last half the of the time and you come much events we've got some all from my MacBook which doesn't have that much memory working on 17 million events
from let's have a look at how many the pool world all minivans with in the Apache Spark pool in the last
half here so roughly 60 thousand all have 1 it's have again you while the top in Apache Spark a report and
forward so now all the calculation is done on the class and only the result that comes back to my husband so that's pretty cool because you do know look much bigger machines and don't have to do the calculations that you let and you also can always reduce the data frames to the sequel that you can also run almost sequel statements instead of using the Python declarative language so that's the name of then a little
summary what's what's box so spot is the distributed general-purpose
cluster computation engine and part is an API to it the resilient distributed dataset is all the article plan but on your data data frames are higher extract abstraction from that of a collection of rules grouped into name columns a schema
and the data-frame API allows you to manipulate data frames from the collaborative domain language the so thanks for your attention terms and questions
of OK so we got this 1
1 the main point of what I wanted to show that the more I wanted to show their age 12 all plastic so that's what I
really like the 1 yeah to a class we 116 nodes and 1 terabyte of RAM that's funny the Tony actual questions then if if you get any other questions come to all those company other talks and come to missing around great big eubacterium thanks
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation


  465 ms - page object


AV-Portal 3.8.0 (dec2fe8b0ce2e718d55d6f23ab68f0b2424a1f3f)