Add to Watchlist

Big Data Analytics with Python using Stratosphere


Citation of segment
Embed Code
Purchasing a DVD Cite video

Formal Metadata

Title Big Data Analytics with Python using Stratosphere
Title of Series EuroPython 2014
Part Number 94
Number of Parts 120
Author Schepler, Chesnay
License CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
DOI 10.5446/19953
Publisher EuroPython
Release Date 2014
Language English
Production Place Berlin

Content Metadata

Subject Area Computer Science
Abstract Chesnay Schepler - Big Data Analytics with Python using Stratosphere Stratosphere is a distributed platform for advanced big data analytics. It features a rich set of operators, advanced, iterative data flows, an efficient runtime, and automatic program optimization. We present Stratophere's new Python programming interface. It allows Python developers to easily get their hands on Big Data. ----- [Stratosphere] is implemented in Java. In 2013 we introduced support for writing Stratosphere programs in Scala. Since Scala also runs in the Java JVM the language integration was easy for Scala. In late 2013, we started to develop a generic language binding framework for Stratosphere to support non-JVM languages such as Python, JavaScript, Ruby but also compiled languages such as C++. The language binding framework uses [Google’s Protocol Buffers] for efficient data serialization and transportation between the languages. Since many “Data Scientists” and machine learning experts are using Python on a daily basis, we decided to use Python as the reference implementation for Stratosphere’s language binding feature. Our talk at the EuroPython 2014 will present how Python developers can leverage the Stratosphere Platform to solve their big data problems. We introduce the most important concepts of Stratosphere such as the operators, connectors to data sources, data flows, the compiler, iterative algorithms and more. Stratosphere is a mature, next generation big-data analytics platform developed by a vibrant [open-source community]. The system is available under the Apache 2.0 license. The project started in 2009 as a joint research project of multiple universities in the Berlin area (Technische Universität, Humboldt Universität and Hasso-Plattner Institut). Nowadays it is an award winning system that has gained worldwide attention in both research and industry.
Keywords EuroPython Conference
EP 2014
EuroPython 2014
the height over the next talk is going to be held by just mentioned a lot about the data analytics with Python using stratosphere this work time and
be more OK so 1 thing that I have to raise before is so we have 300 is installed at you having at this tensor product has been accepted and pitching greater program and had to be renamed due to name confidence and it is now known unknown and Apache sleeping in German term for something that is fast and you have so for the remainder of talk of voters slayings sense of the features that we will make use of the present day will be part of of they actually of final those eventually so the 1st talk a bit about what I think in general it is and then um commanding general knowledge I'm always use media on you Python that expose some features of to Python so what is the edges is and this was time for big data analytics Britain jobs the big follows the of
programming and opera by offering a rich set of operators and automatically optimizing you cross the project started in 2 minuses joint research
project by several universities in the and was later transformed into an open-source program project it's not going to achieve on and think make proteins 0 . 5 2 the last release also Apache and the the next section she stated between within the next think operates in the same you status and systems like had sparkle
automatically adjust concentrates light on scalability and user-defined functions from but combines these with status while declarative t and optimization so a lot of decisions how the system works the dance and that assists in writing a program writing a problem for fleeing with essentially creating some kind of data so you have data sources and operations on them and cooperative
simple so in this case this data sources and that comes to reduce joins the binding of successively capacities that some of of stuff so we don't do the whole thing and operations that have datasets from this is produce introduce function we stream that on the go into the next iteration so that in some cases whole programs genesis and the things that something that so there around you have some and
storage we have your data link itself does not depend on any faster so you can look its use of the term and this was surprisingly well especially consider data and the data this slits for customer just your also Figure 2 7 it's beyond that you will want in your finger runtime and optimize and several of the API stood on top of that kind of languages so we have the job only based on scholar I spotted a careful graph computation meet your 1 for Jason I believe currently developing discovered on the main reason for for meaning years on Python stood on top of each other it's 1 thing that is as you pointed out is that already has the same optimizing same runtime so they're not so that kind of there are 2 connected with so In summary of the key features of of strip of land on the you have various development eyes
that used so when you write a program you're on 1 of your right the plan and study of and the user-defined forms like some operations out Jocelyn like grouping sorting the love mutation is actually done in the have to language and the optimization takes care what concerns how certain person care about holds joint carried out on a date takes into account whether operations operations are carried out for the cases tries to plays subsequent president and say machine so that you don't just data across all cluster the figure on time this is very not belonging to a other systems and that's something that I haven't touched on ideas that we treat interactions most solutions from and optimize so so much to a fulfilling conjecture and we can have a look at the Pythonic so there's a lot counterexample so 1
you 1 how often each word appears in the text this what you do some that's all have you a plant basically on the residue the user-defined functions so so this is go through a set of sets of 1st coated in Brown's online and I write a new program and some figure would lessen common for you for the program and then fact that function to the 1st to data and text file so every line is treated as a separate string and then apply that function through the passing object user-defined functions and in the output so that what type when there is python like and to have a positive this is a side was majority underneath job falls a very strict hypothesis so if you would like to consent program John C. Taiwan's little everywhere 1 of the those things that so the talking about a does this the Landers's loads and connects them to the English Comp we then group the data base and what's the partition of the resulting data then apply to do something to each group separately something words and then words this is going to execute this as an starts so when after can't read text data does not contain any data that is just an accident accident presentation of the will exist once problems executed so where interest of things origins strategies finance foreign and data analytics
jobs on so just referred to provide flames and can be used to partly lot and also allows most Lexus 5 successive standards and she you know so in the grand scheme of things this how the whole
process would to write a compiler from when is executed creates annotation of the program is then funded through job have to create natural thing can resent shipped across the continent so the Python functionality and so on user-defined functions are stored serialized data inside the of incentives to the the on the runtime the data from such this so it can run time and encompasses everything if a job operators supplanted job
is internally need some time to catch the object for the whole Python operation which was used so so when the operators started of opened created a software process the supervisor object possible plans the topological there's something and then parts that from with the country so when dealing with the process of different languages censured types comes naturally machine so what From this we have experimentally systems currently problem we just what we do is we assume that they sometimes you just as well as established and these are converted to think top I
think I was the fixed-length types of container which this somewhat similar to the by atoms a similar to user on so the central venturously come from established in the nested currently the company gonna introduce elements of just type constraints only had come to that differently some of the most severe problems of course that we don't use of arbitrary objects just due to the fact that and wanted to properly so we could just pick modernist because the data but it's not just that starts going on and then come back but there are several use cases not necessary for itself so what list random we want system solve persons wants Johnson so In order to avoid reality programming patterns we decided to give you a lot for the 1 and only spend a bit more time on the same the support of retrieval just by the United Nations I'm from his close related problems following synch time between the processes the protocol initially we use that we problem for those for but the when you read
so for us so this has to with this structured version of so much of the job which works surprisingly well together so basically right of the box so only change across the opportunities through there it's I'm pretty much as I could afford from top of topic time and around the current restrictions for type system bond so students will also like change future but and just over this that we summarize the fields and applies a few extra data independent estimates of a huge FIL refuses we applied the type I true from the stationary be removed for a moment at a small set of times was the same but pronounced during their and their rest metabolic with which is only 1 and useful so the control from and the last element in this iterative when the loss and it's connected for small things like something from the and their size argument which represents novel findings in you so for something it's not a topic this would 1 that's special systems I'm way too fast or so tight knit united is provides a subset of the features job is you're on so you can read data from text sources of funds provide distinct objects within your plan providers Texas you was used for its
principal stands operationalize really precious support most of them in the most important missing 1 iterations couldn't get and time but most of the this for something you but is the defined as instead contrived although it's not production so for this testing it suddenly using it's 1 thing to think about for example for in the future this is a this kind of J. Java API yourself defined we and think about it it's going to compete in the job outside as something that couldn't performances similar to human performance solidifying hold standard stress told him given to to flying around we want to provide sentence in Python Python incidence of free so that all kinds of regional high and that and go to transformations to and and so want to would you say are you so we tried it on and I want to to be more precise in Python 2 . 7 you know that have to minus issues that parental from 5 3 just types of knowledge on string from consistent trajectory what 1 conditions so in order to truly run locally so you can run it locally known as to try it on and what that that's a 2 fleeing started and you can watch already run so you can also download and can move through terminal and to get up and running for accessing a bit more about interstellar configured on that on as a right now now you need a manager there that is used in this should be defined among the classes so think package as well so now finds the provides a distance implemented in 1 cluster so you don't have to mainly just this also means that if you make a change of domain knowledge this changed automatically propagated someone just to surprisingly sold what so I hope I was going what I hope that you could keep you interested in so if you have more to watch concurrent website Pinto incubated at you and you want to try out the 5 interfaces and I'm going to this approach with a set of hundreds of times more attentive mutation of the here and thank you for the attention few
people that detects and questions and then can systems and we don't have the microphone is going to be the questioned this is not the only thing the good in the 2nd part of lot of Marianne was born back in all the the idea is that this is what systems a lot of work in the I actually questions why would I use the Y which should 1 use feeling competitiveness sparse so 1 thing that we do better is that we provide a better and intuition but generally more efficient so we tried to use we use a lot of operations on to prove to prevent competitive and conditions 1 thing that should be satisfied is set spots generally moment you currently is the task Christians yes and so the question was if if the data is telling me from predicate also petitioned when executed on the cluster and it is as far as I know it is partitioned the class this was the last year or in now so the question was a very great and pretty what what happens when we need the piece of the preconditioned data and we need to access it Another petition from another system what happens I the I should mention that 1 so having never look at that code to be honest I have only been project for instruments most similar and 5 in the so not very much informed about really telling did stuff and that
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation


  328 ms - page object


AV-Portal 3.8.0 (dec2fe8b0ce2e718d55d6f23ab68f0b2424a1f3f)