Realtime Distributed Computing At Scale: Storm And Streamparse

Video in TIB AV-Portal: Realtime Distributed Computing At Scale: Storm And Streamparse

Formal Metadata

Realtime Distributed Computing At Scale: Storm And Streamparse
Title of Series
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Release Date

Content Metadata

Subject Area
Realtime Distributed Computing At Scale (in pure Python!): Storm And Streamparse [EuroPython 2017 - Talk - 2017-07-12 - Arengo] [Rimini, Italy] Realtime distributed computing is tough, especially at scale: managing a large data pipeline is tough, and it’s even tougher to keep latency low and availability high when processing tens of thousands of items per second. Many people turn in despair to Java or Scala when it comes time to scale up, but we can do it in Python: Apache Storm is a distributed realtime computation system that can let you scale up- and no need to reach for a new language! This talk will walk the audience through the basics of Apache Storm and how it’s an elegant, useful solution to realtime distributed computing, as well as how streamparse can let you write your storm components in Python by writing some code and a basic storm topology in Python. We’ll also look at how Parsely uses Storm in production to handle billions of realtime events a month. If we have time, we’ll go a bit into how Storm has several advantages over other common Python computing data streaming solutions, like Spark’s microbatching. Goals: At the end of the talk, ideally you should be able to understand: What Apache Storm is, how it works generally, and what scenarios it’s useful for How streamparse can be used to write your Storm topologies How Storm + streamparse is used in an actual high-availability, low-latency production environmen
Thread (computing) Physical law Moment (mathematics) Nominal number Bit Real-time operating system Streaming media Line (geometry) Student's t-test Thread (computing) Algebraic closure Mehrprozessorsystem Interpreter (computing) Selectivity (electronic) Pattern language Quicksort Task (computing) Task (computing)
Standard deviation Slide rule Implementation Trail Link (knot theory) Content (media) Set (mathematics) Water vapor Instance (computer science) Thread (computing) Process (computing) Befehlsprozessor Cuboid Diagram Queue (abstract data type) Quicksort Multiplication Computer architecture Physical system
Point (geometry) Area Trail Code Interface (computing) Data storage device Database Real-time operating system Streaming media Line (geometry) Software maintenance Computer Pi Process (computing) Algebraic closure Different (Kate Ryan album) Diagram Queue (abstract data type) Quicksort Physical system Computer architecture
Web page View (database) Multiplication sign Analytic set Digital signal Mereology Event horizon Power (physics) 2 (number) Web 2.0 Duality (mathematics) Self-organization Text editor
Scale (map) Group action Freeware Scaling (geometry) GUI widget Multiplication sign Staff (military) Formal language Product (business) Formal language Programmer (hardware) Website Physical system
Scale (map) Slide rule View (database) Web page Virtual machine Formal language Message passing Event horizon Causality Read-only memory Network topology Query language Pattern language Process (computing)
Tuple Code Transformation (genetics) Connectivity (graph theory) Source code Real-time operating system Streaming media Student's t-test Product (business) Goodness of fit Blog Network topology Different (Kate Ryan album) Core dump Software testing Abstraction Physical system Covering space Parsing Graph (mathematics) Common Language Infrastructure Code Sound effect Analytic set Connected space Process (computing) Network topology System programming Self-organization output Quicksort Row (database)
Tuple Source code Source code Sampling (statistics) Electronic mailing list Streaming media Streaming media Incidence algebra Mereology Event horizon Field (computer science) Coefficient of determination Word Type theory Algebraic closure Single-precision floating-point format Network topology Self-organization Data logger Integer Cycle (graph theory) output Analytic continuation
Tuple Trail Slide rule Group action Water vapor Function (mathematics) Streaming media Counting Field (computer science) Number Doppler-Effekt Network topology Natural number output Descriptive statistics Social class Task (computing) Exception handling Physical system User interface Multiplication Topology Graph (mathematics) Spacetime Parallel computing Forcing (mathematics) Consistency Content (media) Counting Streaming media Total S.A. Word Process (computing) Auditory masking Personal digital assistant Function (mathematics) Network topology Chain Order (biology) Universe (mathematics) output Quicksort Arithmetic progression Tuple
Tuple Self-balancing binary search tree Scaling (geometry) Parallel computing Connectivity (graph theory) Parallel port Computer network Drop (liquid) Computer System call Data model Connected space Process (computing) Resource allocation Network topology Website Process (computing) Vertex (graph theory) Tuple Physical system Task (computing)
Context awareness Code Multiplication sign Parsing Formal language Connected space Process (computing) Series (mathematics) Local ring Multiplication Parsing Pattern recognition NP-hard Spacetime Common Language Infrastructure Software developer Parallel port Streaming media Process (computing) Software framework Configuration space Website Task (computing) Reverse engineering Topology Implementation Connectivity (graph theory) Virtual machine Streaming media Sparse matrix Event horizon Twitter Causality Network topology Gastropod shell Communications protocol Compilation album Context awareness Beer stein Multiplication Focus (optics) Interface (computing) Projective plane Java applet Line (geometry) Directory service Component-based software engineering Integrated development environment Algebraic closure Function (mathematics) Network topology Video game Gastropod shell Communications protocol
Group action Dot product Demoscene Freezing
Tuple Constructor (object-oriented programming) Multiplication sign Source code Database Real-time operating system Insertion loss Parameter (computer programming) Computer font Mereology Fraction (mathematics) Virtual reality Type theory Semiconductor memory Social class Source code Real number Constructor (object-oriented programming) Virtualization Statistics Type theory Sparse matrix Message passing Configuration space Stapeldatei Computer file Device driver Online help Streaming media Sparse matrix Regular graph Event horizon Product (business) 2 (number) Revision control Read-only memory Network topology Operator (mathematics) Newton's law of universal gravitation Installation art Stapeldatei Total S.A. Database Software maintenance Single-precision floating-point format Medical imaging Personal digital assistant Network topology Universe (mathematics) Electronic visual display Abstraction
Point (geometry) Tuple Trail Slide rule Parsing Digital filter Topology Group action Topology Overhead (computing) Civil engineering Cellular automaton Video tracking Message passing Arithmetic mean Process (computing) Network topology Network topology Personal digital assistant Message passing Communications protocol Tuple Condition number
Web page Functional (mathematics) Group action Overhead (computing) Link (knot theory) View (database) Multiplication sign Similarity (geometry) Student's t-test Graph coloring Computer programming Product (business) 2 (number) Power (physics) Force Different (Kate Ryan album) Core dump Software testing Associative property Traffic reporting Mathematical optimization Physical system Pairwise comparison Focus (optics) File format Binary code Sampling (statistics) Mathematical analysis Data storage device Counting Sound effect Basis <Mathematik> Database Instance (computer science) System call Process (computing) Personal digital assistant Network topology Right angle Figurate number Abstraction Reverse engineering
how uh thinks everybody 10 thanks for coming to hear me talk but I promise I will be as fun as the beach party so that so quick show of hands so who here is either dealing with war is going to have to deal with a lot of real-time data very shortly there so little bit of a selection bias against that and how many of you are despairing that you may have to turn to let go always follow the closure some other day 1 and it's a there's a better way uh you can actually values storm and stream priors to have to do that without writing a single line of smaller closure uh X a quick background flows you on familiar with the the difficulties of doing I have found that the multicore programming and sure somebody is having a fistfight about this very topic right now some of the village at the private has the Gill that does everybody know what to do is related to the patterns global interval yes students for those of you don't L. Python doesn't actually do true and the moltaí and doesn't you everything at once there's a global interpre law that prevents they could from executing on different as simultaneously this nominal issue for things like that I O-bound tasks like fetching data from the internet or that you get from this sort of thing but the the moment you tried to do any sort of computationally intensive task of more than 1 thread and you have
my contention get slowdowns you get all sorts of terminus that usually the way this has been dealt with and I'm sure many of you you're familiar with this as cue water
system which bypasses the which bypasses the gills the traditional sense that you see that you that you would like readers or rather than Q 1 of those things that you have workers like got your salary that pull things off a you do some sort of processing on them and push them into the Q which other work triples from and so on and so forth and this gets a little here so and also we're not as interested in being the girl anymore but we're going to max out you enlarge the boxes so we need something that can always got a multi-core became skeleton multi-node or cluster implementation as a set of visible matter even the biggest interviewers instance eventually that is good before continued does a really great and just dive into the into the Gill and like what contention and all the garden surrounded or secure gave that have a link to the ocean the slide that In 3rd the Q 1 worker architecture that I can lead to really really uh complex diagrams
uh this is this is a diagram of what the person architecture look like a couple years before I came on board and and it was quite frankly I is a tough town were no engineers with it was tough to keep track of how you accuse you of workers point accuse perfect new cues pushing to all sort of different databases you to keep track of which to the point where 1 and employed when and it was this is the stuff and it's maintenance and just mentality to keep going and going faster and store
fantastic and storm is purpose-built to distributed real-time computation system that simplifies this whole worker and you business and and now think surpasses work with stream person pies from we have a completely Native Interface uh too strong lets you area code the plate to cluster and and then into in line of closure or job or anything you don't that is a quick
quick background us and we use strong for that we are a web analytics company and and we do kinds analytics for dual storytellers solar Conde Nast Mashable and we ingest tons and tons of the view part B an event data for these publishers so they can more easily tracked and these are loyal to you know engage time on page 70 pages in seconds otalgia posts got that's and we use this this data to power
dashed was like these which are available to editors writers anybody in the organization to see the performance and also power on-site API
as as so if you ever seen what's trending or what's popular or you might like we offer a robust API system to pull data out of the end of our system and into public from and widgets we also offer just recently and I than a did a pipeline which is accessed for raw data and so that only now are we are we processing data to these staff BI without enriching the bare-metal data for our customers to build their own nationals of products over and whatever detectable and a lot of times when I'm with other programmers that's not what we do and the scale at which we do it and here a few things that maybe you guys have heard that by thank induced
the free lunch is over Python doesn't scale this just a scripting language or group glue language and what you should use go rest scholar see name someone's recommended to us that supply can still and that's a screenshot of a
child when across the machines and it's all over cause next out by using justice from a couple strong topology to what we going to the cluster the and skills quite this is
just a quick overview of the that we that we took in in 20 16 slide there bound I think just just to the US election day lonely we're running at a gratuity billion events and and sold and sold in 5 so pattern can still a strong can help you so consumers and the idea that you do that so extreme pass is a
Python it's an open source and we've developed to help you get off the ground using storm as an implies that the two-pass real-time streams of data you can integrate your own Python code putschist which funds and Javier and so there's a multiplying effects that great quicksort and documentation uh good command line tool as production testing we use it every day to very mature and it it runs the absolute core of a real-time stream system so that any it's good for anything that requires subsets of segmented their leads so analytics lots centers recent so organ
just could be trying cover is what star-topology had only use it to process my data has a internal how can we use Python with it quick from and some examples of history systems that we do but at person and how many of you are familiar with some of the stuff quite a few perfect 11 and try to students look quickly for those of you who don't and storm use a couple different instructions for how to describe the overall computational grafted to create I a topic which is just at an individual record of data that's passing through the system topology because now which is the source of data you get a ball which is a component that processes input and sons along the topology of the input can be a static audible and and topology which is the the overall design it's a collection of components that create the competition graph and here's kind of example of what that might look like the best out of which can be any source of incoming data can be read as it can be used to be laughter and what have you arrested after overwhelming with the stuff people use that have to be and you got different bolts that it passes the data do differences of processing filtering transformation whatever it is reduce the data and use the final step is that some sort of of ETL process them into the Cassandra ElasticSearch would have so
it'll just a single data record and you can think of it as like a like a patentable so the fields spec years the word the was dog if its wording you kept organ integer as the as the 2nd and here's and here's an example of a spell well maybe encodings that this is kind of an arbitrary Scott comes from our quick from a sample of workout topology and you can see here and this is basically just a stout that cycles through a predefined list of words and it's the words that's the important part of course being that's written entirely in Python and you have to go to closure or scholar anything talk to so the about that yet
starts can be incident but it doesn't have to be continuous and a lot of people and this is the fusion you this totally OK for a start to to stay idle or sleep a little bit ample on the incoming data source for new that to be a continuous stream of events so you can use it for you things that just get and that like burst streaming on and you can still use and it will be 2 of phone and fax users that His example the
ball and so you can see here the the bolts that takes on a topic that we just passed that from the from the that and the process that you override from the from the old class uh you can just do whatever you need to do with the work in this case were arbitrarily incrementing a counter as based on whatever the were happy to be have that you can do things far more complex than that of course and once you've got processed the talk will you can pass on so you can you can create very complex competition graph this the topology is simply the description of the for the rest of the graph itself so you can see here nothing crazy and we've got a word count topology we go words that and then we got a word count bold that takes input from words that groups in the work field which will cover this 2nd and it has a carols moved and so you can describe the entire topology was very simple and you can have big topologies with many bolts and about and to be because as just a quick overview of that so you've got and you input the output so the Tumpel has the word as its input that cumple outputs words you can see how it outputs the the enriched triple so the the topple comes into that in the original from the from the wall and something that I would like to touch on is the grouping and Saunders has really cool cultural just gets the next slide you can actually I'm for stateful processing you can and make sure that based on on certain values of the trouble that it gets passed to the same task for a for a consistency so some for grouping on word that value will hashed and stromal make sure that test and for that for that value will keep going to the same to the same task which is which can be really useful for the ball shuffle just means that the tuples order passed around to test the Doppler downstairs to topple allocated at the dance to so that no then you know no 1 knows that without something really cool about and storm is it actually acts and fails is a very robust messaging system as the temples move through the move through the graph so storm actually acts fails every individual tuple which makes this really makes a really easy to do other liableness storm you can do it at least once a exactly once the sort of thing you can you can keep track of a but every couple as going through the chain stream force actually implements of automatic autofill for you but you can also do your own and you can override those and have and you can't you can specify when winnable axle winnable fails and the total and then you can replace reference them so it's really fantastic it's say this quite often that the fact it's kind of like by its very nature but the tuple trees nature you can actually track the and the progress of multiple system that I will linger too long here the storm also has a really nice you I if you want a deployed storm to Europe universe mask is called and that's what you put in this you can log in viral cool web interface and you can see topologies summaries you can see the star summaries you can see the number of tuples actin failed multiple submitted you can see the exceptions are mutually the workers itself it's so it's a great tool for doing real-time distributed computing and and the way the way it does this is use of a cluster are given in the snow to the master node many of worker nodes each with a certain number of slots that you specify depending on how come at computationally intensive at what you're doing is and you simply deployed the topologies to this from a cluster and the topologies based on what you've configured will use up the slots of the the of the worker nodes to do the competition and you can run into
an issues of contention there the force so you try to play a 5th topology and we can do what you need more space so you do it all you'll just can't sit there with no water processes as you can you can then rebalance the storm the strong closer to take into account and you and so straws recall uh if any of
you have been searching for a real-time distributed system that can do this at scale you found a strong guarantees processing tuple trees that you can tune the parallelism for components so you can have fewer slots or more slots for less or more computation intensive tasks title ability if a node drops to the strong cluster can take care of it uh it actually uses Python process lots at each each slot actually spawns new Python's and process so sites of studio soul running at small multiprocessors suffered and you can rebound computation-intensive tasks cluster cluster and handles actin filling in the requesting automatic and accept until now it was really hard to use with Python which brings us to our main thrust storm donors
and we 1 use by this right nobody wants right go over scholars off you but that's not that's not over to stick it all the people is that we can use python for this so there is something called a multi-line phone call and instruments implements
and this allows you to interface with a strong cluster and it is a language in stormed ship was something if it if you tried to do this before you've probably encounters Michael strong that by the body is modern comes from the that so there's a couple of issues from that uh should this like this but it's not really the Pythonic it's more reference implementation anything it's a to be gently packet DRA this than this quite a few issues and so we decided that strong up I wasn't really taking into account uh all Albany things that storm small to my protocol last because most like protocol on space is actually equal it supports and but this is jason passing back and forth between the documentation and the Python shell that work specialties components communicates over standard and standard out so it's quite clean very Unix it doesn't use up high forger which are out of anybody any but you use this for but if you've ever heard he had a job a trace and you like how did my code cause that you fit you won't have it and it's also principle endless and so so redundant but yet when prices protest speaks of Jason individual parallelism that the handshake the handshake between the the Python process and the the documentation will control configuration in the context of the puck process as socially but up until a couple years ago there was a really great way is the and I should note that that there's 2 projects that we have uh despite strong which is really under the hood it's the multi like protocol by which we communicate that to storm bold but we're gonna focus on the remainder of the lecture is an extreme bias which is your interface to storm that uses price from under the hood the we do maintain those that so that's reverse we initially releases couple years ago in 2014 and it's over 2 years of active development and again heavily heavily battle-tested and Stella stars and didn't think it shall contributors 31 contributors and we have 3 and it is actively maintained yet it battle-tested we passed times of data every day that many millions of events and and it does great and I was going to do
uh a life installation of stream past but in the interest of time I understand what I just pre-prepared some of these trends for years and they are all you have to do it has lines you the only outside the tendency is lined with line in which is a closure compilers that but once once you have line installed just creating a virtual violent granted install stream purse and then do spots which surpasses the is the command line this this person a quick and then you can just you can just change the directory decreasing you can actually use the locally most of topology recognition both to mention this you do need to have the strong development environment set up a new machine but you can just download that from the putschist website it's too quick this just going strong then you path and do have a gift for this in
action that's freezing but you can see here you can see it's compiling the dots 240 behind the scenes and then send it off to the races chair
and you nose all the course where were being utilized to the the has the go and it was a 5 minute from 105 minutes but from that install to foster that which is
you we think witness and and the same idea once we have a real cluster set up and configure and you can simply type sparse and is a small contiguous sound file this documents in the help docs even that you get the but obviously lost in the parameters that are and so forth and it does all this which is really great applying and makes a version across the cluster installs requirements of the driver source code actually opens the tunnel to the Nimbus constructs of the qualities that memory and up was the Jordan and this and the topology so it's actually really great for marketing that and applying unitary that tourist afterward additive quirements or writing font file that manually updates all the virtual violence sparse can do for you so that it does take longer headed out of prior to place qualities and maintain and there's a lot of of say uh there's a there's a a lot of other commands that has this just a quick overview of them that some of which are the diagnostic some of which are functional but I encourage you to explore the mall of the and so this is an example of how we use in production and so we have a couple of starts coming in and we actually we actually take the case you events in real time and pass through the topology and and batch encirclement ElasticSearch cluster ends and like you can see it's not very very difficult I don't know the complicated diagram and but it's extremely robust and it's 1 that we it's and abstraction of course but it's in a fraction of 1 that we use in production every day to be used in that
so there's a universe regular balls that you just reading now you can also have Pachelbel's which is great for database of there is don't like to be and had hammered with 1 at a time and insert and update requests seeking batch you can better budget of troubles everyone 2 seconds or so and dual and do a 1 batch operation on them an and so we there's there's a classes for that patchable into the spectral and and answer that we act and fill every individual total so there's also already class stream parts for 1st for measuring reliability it's a start that will automatically be play any trouble up to a certain amount of retries that you specify which makes a great for easy using low-latency reliable messaging and there are
a couple of and on the wire here there should there a
couple of performance considerations and a couple of you during multi-line protocol might have thought hades in passing J. sound a lot of overhead to serialize industrialized as things move move through the topology the answer is yes and so if you're processing lots of small messages it's better to use the vegetable and and action like once every 2nd or so and and it's best to filter out a lot of those if you processing all tuples fill them out point really and so you're actually does not passing through as many as many tuples and don't like grouping overall you and it's usually nice means of civil processing but if your data is imbalanced available if this 1 values all for common grouping will swamp that when executed so you shuffle and as you have to an end you several small topologies as of 1 huge 1 the because some amount of work that if a tuple felsic because everyone when it toppled replayed it gets replayed all the way from the beginning of cell and and also storm is more efficient for a tuple tracking that spot topologies which is accurate condition reactions from that's themselves and final slide
uh cool so for do when implemented and like Kafka reader after reader and writer bolts which is where the huge cases 1st reports and this effect which is a binary association format that cuts down the overhead Jason and this is an yeah that's true price from and where possible account so was named few minutes the the so I scientists and white is that you stole and not optimal stock that's actually act if I had the time I was going to go into why we did use like sparked my prevention and so we actually do you spot in our not historical analysis later but we found that spark my production doesn't really make use of 2nd of latency and so just uh just sense system really but I think if you if you know about latency and you're ready you start think my productions perfectly from of here's another 1 he so you mention that I would be felt like a real-time system that aggravates that for example we have like a hundred requests how 100 thousand requests per 2nd for the end you do later gets a lot of basis for example you would things like power or whatever and I want to the telecom and powerful in the 2nd minute in not so they do like attack Saul fast Solarix to see more of all I and I can tell you how we do have act as we we do something similar so we went outside the engagement and page views that we we actually aggregators no less no not quite a hundred thousand requests 2nd but were the the ballpark and the way we do is we actually have been that's reverse rights to and a Cassandra cluster which then everyday and I don't know exactly many seconds that every x nanoseconds we create like a role of document search which we then insert with the aggregate count and the bucket of attention so that we we you can actually do a that whole pipeline actually handles elected you quite well it actually a link figure out the US issues of storage if you have if you have 1 less search document prepared you which is why we do the aggregate roles but as a reason that when in a similar situation when we're doing like aggregate role documents in database more questions in and a think the talk 1st and then the question is if we want to take political PWS and support them out in what about the possibilities still wife to support of scaling and I'm actually not sure as far as schooling goes unfortunate and we we do all our node scaling by hand because we use reserved instances and so I'm actually matches in this I want to skip the whole world that was the 1st program in comparison to the normal most form all Trident only spot color of the assessing and how the program will students come close to the Titan called the formal legal and then moves on to OK yeah an action that superfluid Trident seconds the strident and but that the kind of that kind of an I think the core abstraction and you compare the strong topology to in and spark be the ITD and in a similar way and the affront RTD you'd like such data and then perform like different functions on it you pet your past like you know you would like to call different functions like MapReduce whatever on the spot party that that the equivalent to that would be storms topology which is you have spouts and bolts and uh you could you could I suppose reimagined the balls as functions so this temple is getting process through these functions up until the end where gets loaded or otherwise evaluated somewhere and the supposedly the closest attraction PTV 1 lost short question so and here it's a good 1 so that's here you have the 1st sample last test and so where basically engineers that focus on so we have product in does like actual co-development work and features and we develop anything that is customer-facing applied reasons why would work on some those dashboards or that the data pipeline ETL or BPI stuff would all fall into the realm so success and user and user also more sponsored customers problems and also work on features that directly impact OK things other than