Processing Streaming Data at a Large Scale with Kafka

Video in TIB AV-Portal: Processing Streaming Data at a Large Scale with Kafka

Formal Metadata

Processing Streaming Data at a Large Scale with Kafka
Title of Series
Part Number
Number of Parts
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Release Date

Content Metadata

Subject Area
Using a standard Rails stack is great, but when you want to process streams of data at a large scale you'll hit the stack's limitations. What if you want to build an analytics system on a global scale and want to stay within the Ruby world you know and love? In this talk we'll see how we can leverage Kafka to build and painlessly scale an analytics pipeline. We'll talk about Kafka's unique properties that make this possible, and we'll go through a full demo application step by step. At the end of the talk you'll have a good idea of when and how to get started with Kafka yourself.
Computer animation
Computer animation Product (business)
Server (computing) Multiplication Service (economics) Electric generator Multiplication sign Source code Database Streaming media Extreme programming Mereology Product (business) Computer animation Personal digital assistant Blog Website Error message Form (programming)
Computer animation Total S.A. Line (geometry) IP address Social class
Scripting language Randomization Multiplication sign Gene cluster Counting Bit Database Line (geometry) Web browser Streaming media Number Human migration Frequency Mathematics Computer animation Different (Kate Ryan album) Right angle Text editor Data integrity Social class
Point (geometry) Thermodynamischer Prozess Server (computing) Computer animation Database Quicksort
Server (computing) Data management Computer animation Video game Mathematical optimization
Point (geometry) Satellite Curve Stapeldatei Multiplication sign Database Streaming media Line (geometry) 2 (number) Neuroinformatik Frequency Category of being Computer animation Personal digital assistant Different (Kate Ryan album) Calculation Single-precision floating-point format Physical system
Group action Server (computing) Data storage device Bit Database Streaming media Line (geometry) Computer programming Number Category of being Message passing Word Computer animation Right angle Quicksort Object (grammar) Figurate number Service-oriented architecture Table (information) Partition (number theory)
Category of being Group action Server (computing) Roundness (object) Computer animation Channel capacity Personal digital assistant Network topology Bit Service-oriented architecture Field (computer science) Physical system
Trail Group action Dependent and independent variables Server (computing) Moment (mathematics) Client (computing) Message passing Computer animation Personal digital assistant Different (Kate Ryan album) System identification Data structure Service-oriented architecture Arithmetic progression Partition (number theory) Physical system
Server (computing) Computer animation Internetworking Personal digital assistant Plotter Direction (geometry) Energy level Mereology Service-oriented architecture Family
Trail Source code Analytic set Time series Counting Login IP address Computer animation Bit rate Endliche Modelltheorie Table (information) Physical system Task (computing)
Standard deviation Matching (graph theory) Counting Sound effect Database Code Preprocessor Computer animation Hash function Bit rate Network topology Point cloud Endliche Modelltheorie Row (database) Computer architecture Task (computing)
Point (geometry) Greatest element Server (computing) Computer file Code Multiplication sign 1 (number) Mereology Login System call Web 2.0 Message passing Programmschleife Computer animation Deadlock Right angle Partition (number theory)
Web page Area Key (cryptography) Euler angles Structural load Planning Line (geometry) Instance (computer science) IP address Message passing Computer animation Bit rate Object (grammar) Address space Task (computing)
State of matter Multiplication sign Moment (mathematics) Data storage device Database 2 (number) Message passing Programmschleife Loop (music) Computer animation Hash function Buffer solution Row (database) Task (computing)
Game controller Standard deviation Computer animation
Computer animation Open set
Area Laptop Server (computing) Presentation of a group Hoax Channel capacity Multiplication sign Electronic mailing list Data storage device Price index Web browser 2 (number) Wave packet Neuroinformatik Preprocessor Propagator Different (Kate Ryan album) MiniDisc Right angle Logic gate Resultant Physical system
Presentation of a group Service (economics) Gamma function File format Multiplication sign Bit Database Message passing Different (Kate Ryan album) Operator (mathematics) Internet service provider Configuration space Traffic reporting Form (programming)
and the and the ph a so what I'm going to talk about those things in the data of the all the little about myself 1st
so um when status I would have signal so my name is much like
this so this is like a tutorial fried do that the I I'm from Amsterdam Netherlands and actually today is the biggest holiday of the year so so Emsalom currently look look like
this like all of the city and I'm skipping this body to be here with you today so they all come from the so we will monitor product
for will be the lecture apps um and as as always with these kinds of uh products you you start with the question how are how hard it this be and you assume that's not really going to be hard and then of course it always actually is so it turns out if you do a product that you use posses a lot of streaming data and so so our motivation works is that we have an agent that people installed on the server it's viable it is running on all these customers who she and they're posting data was regularly woodlot errors that happened and stuff was slow down we process that income for merged together to to make the wire and that and doing their thing from and that's streaming data service related issues we defined like this so uh is generated continuously 1 of uh on a regular and follow in comes from multiple data data sources which all which were all posts in its synchronously simultaneously and that is some some classical problems as shown associated with this so so 1 of his 1 is database locking and if you do a lot of small updates then the database is going be lost a big part of time and effort would come with slow you can't have to love them this stuff around and make sure that ideally you make sure that stuff and so the same worker servers uh so you can use with modest offered and so let's look at the really simple Stream data
generation will use the rest to talk to uh to as in this case so we've got a very popular website as visitors from all over the world it also has a service all over the world which which are an interesting for the visitors um and we wanted to surprise on blogs basically so uh we we have forms the extreme
of lines coming in which have hardly the visitors IP address neural they visited I notice the standard stuff and we actually want to into this class so
this is a nice this is a graph of like the total amount visits we had from 4 different countries and this is actually hard to do
um and the answer is is that on a small scale it it's not it's it's actually quite easy so the simple approaches is
to just the dates uh uh database like for every single wildlife the so that looks a little bit like this so basically is doing a day period as a as a country stable it has a country growth and can't feel a just a big thing every single value you get and you get a visitor from that comes from from that some countries um the issue is is that a database uh has to make sure that all they that in data integrity is kept so so it don't actually understand that all the streams are kind of continuously and they do do the log line actually never arrested to go go back in time there's something but the database has to take into account that that that that data that already exists could be dated again so it has to do more part of around the world and if you do this at really high scale than the whole database will just long down and there will be no time left to actually do dates so we ran into this number of times during our existence and intentional so 1 thing you can do next is shot in the data so basically is put put all the bits visitors in the David 1 you put the you was once in number 2 and you can kind of just skated out will be there in the data and on some kind of access and just put it in different editors clusters the 1st and the 2nd dump sites so what happens is that if you want craters data you wanna get out of you feel that in there you might have to end of varying different database classes and and to manually merging all the stuff which can of really slow complicated and a classical 1 as well as changing chip shot and so if we now decided we actually want of counts per browser that people use all along and then we have to completely changes random right big migration scripts and it's going to be really complicated but I and actually we actually
do a lot more than just just increment the counter so we we do have to do a lot of processing on on this data as it comes in so we not only have a bottleneck in the database itself we also have a bottleneck in the process in the country for and days so we start sort of started doing
this at some point so we would have workers server uh and and customers traffic continue 1 of the servers and the flesh to add to this question doesn't really big issue with this which is of course
that the worker server could not and and uh like discussion is going to be really unhappy because they just had a gap gap in reporting for maybe 15 minutes um so what you can then do is
just put another management from it and all the traffic will be randomly distributed to all the worker nodes uh and and this works well but then still worker doesn't get all the christmas data so as to get you like really really smart optimizations I and yet so the data is fragmented then and we cannot do this stuff and when you do but actually a life would be really awesome
if if this were true so we get all the data from 1 customer and the same in the same worker because then you can start moves from pretty smart stuff so really simple smart thing we can do is this so so basically and the standard way to do it is just increment the counter and every single time lot line line in assistance but if we had all the data locally we could discuss that from the the and uh we just have this single of the period of the curve were maybe every 10 seconds and doubling decrease allowed on the bed so this is kind of what we wanna do we wanna be able to to batch uh the streams some kinds caching and as local calculations and then just why write it out to to a standard um I we actually need this to do like the statistical clicks we wanted to so if you wanna do proper % also histograms and kind of 4 satellite data on 1 computer at some point because always you can do the calculation so world there so we're back here so we want like about all the customers in streams to is to a single worker uh which get into a database and actually if we can do it is that you know we don't really need to start more precise so we could maybe get away with just having a single database but of course were better started at the beginning of the talk because we uh we now still at a single point failure to think and feel and customer will be really unhappy
so we need something special and that will be got out for us so we look at a lot of different systems and got got some unique properties that
uh that allow you to do a Google specialists so car accident makes it possible to love and stuff and and do the right thing into the feel of a properly and I'm saying makes it possible like not makes it easy it's still like they are it's and this is possible which is better impossible in my in my book um so I will now explain that might explain property you and that's actually is sort of complicated thing about it and that which is that as for different concepts that you all can lead to get in and also in relation to each other or to be able to understand the whole face so especially bit of a hard thing to wrap remember to wrap your head around if you must use to so very clear and I'll try to make clear to you
so these are the foreign deforming concept you can the beavers really bad idea and but they will show up so the 1st thing is topic so the topic is kind of like a database table is just this a grouping of of stuff so you so topical contain streams stream of data at the which can be anything could be log line or some kind of Jason object or whatever uh this could all be a topic and all these messages uh which on the topic there they're in different partitions so a topic is partitioned in say 16 a 16 pieces and uh a message will always end up on 1 of these petitions uh and the interesting thing is that you can choose how to partition data so if I have a message that has the same people all always end up in the same topic so we could get from group like our US visitors together from 1 to will look at Addis works a little bit a broker as it is to cut the word for the 1st server basically I'm not sure why they picked a different word with the United so a broker stores all this data and and make sure that the that that that can be delivered to define a concept which is a consumer and consumer is a cocker words for basically just applying a database prior for its way uh some you from your crowds to to give to read his messages that are on the topic and got a kind of likes to uh offenders words for some reason so a lot of these things already have a name that they also have a which can be a little confusing uh Silesian for concepts of a mouse go into them in more detail so this is what I what topic looks like it it it this this specific topic 3 petitions and it actually and it has a stream of messages coming in which all of an offset so the offset is the number you see there which starts at 0 and it just automatically increments up so the new data at right side of this and all the data is going out at left side and you company figure how long we want this data to stay around so usually would make this so this would stay around for a few days and then after say 3 days that that that had left side would just use be cleaned up would just fall off all of the fall of retention so we were uh messages are coming from the right side and so if we group these messages by country as we do here and then then they will actually always and opens on the same petition and that's a really important thing was like uh because what could double become apparent when we discuss a consumer so Mexico's broker so uh broker is is to proper a server and petitions and messages limit the servers and broker is always to primary for some petitions and secondary foreigners and that looks like this so say we have 3 programs
so a broker 1 will get 1 2 tree as as primary property will get 4 to 6 and the tree will get 7 to 9 and actually all the brokers will be secondary for 4 4 from other brokers Birmingham petitions so that means that if if 1 of a broken brokers died actually can read all the data and and was all still be there so this case broke free diet uh and uh a broker to both got some extra petitions so it if if you still have enough capacity of system after this failure and and all fields will still be working there will be no lectern will not will be actually broken might be the case that that you were a maybe a little bit less action this capacity is needed and then off you might slow down but if she planet properly then than uh this is this is still a fully working the other thing uh this
also works the other way round so if you go from 3 to 6 servers proceed you got a new BQ customary need not capacity it will also decided medically spanoudis petitions of these brokers without you really have to do and here any work for it so the 4th and final concepts I will now tell
you much more in detail as consumer so the consumers discover a client is basically comparable to that of a structural metal writers clients it's uh it it it lets you listen to the public uh and and 1 of the great things but tough guys that you can have multiple consumers which I'll keep track of their own offset so in this case so we have to consumers like 1 is responsible Senate Select notifications and the other 1 is responsible sending identification know so they both start at the beginning of offset 0 then it actually turns out that as sectors that moment so we can use our API so in this case the the slack consumer is still stalled at the offset 0 it's still so it's just it's just waiting there because cannot contain the moment Fisher actually gone I just find it on evaluation of and then a slap slap comes back on the uh this lecture will actually will we'll make some progress and finally they uh there will be at the end of the of the of the cue when from all messages common so this is pretty neat if you are if you integrate a lot of external systems which can and make sure that that's the last 1 I additional and a certain vendor is is not going to impact the all the other and so this example only uh as a single petition so obviously you probably don't have more petitions so how about that um and got has thing for that as well and which is a consumer group so assume can be groups so you give the name and uh will understand a different consumers running on different servers with same name uh our our are are related to each other and will uh send petitions to them this actually looks a lot like other broker
works so so if you have a topic when I'm petitions and the consumers and offer consumers will get one-third of the petitions um and if 1 dies the same thing happens that's so these consumers get assigned uh uh the precipitation from broken consumer and everything will just keep working and so consumer always gets a full petition and and and and since you can control uh to which partitions sedated data go at this election to like news right infinitive about earlier where you make sure that all the customers data and have on the same server for so then this this so we in this
situation so where we actually have a very similar situation to the 1 we started out with 4 we have like a few workers servers and 1 of them down dies but actually in this case the question was not going to be unhappy because to plot of the deck that the consumer is down and will be assigned a petition to a different worker uh did this level within a matter of of same in it and nobody really notices that often fail because because just rerouted to something that is actually still working yeah so now we're going to be interest part like seeing how they actually would musician will be this is clear to everyone so far on the internet often what was of and then there is basically
monopoly direct relationship so that the the 2 brothers and the petitions both use this concept uh sorry brokers and consumers both use this concept of petition in a very similar way but actually to the consumer it's it's not really relevant when the data stored in just knows that the the Kafka broker will damage when fetch data from social from the consumer side this just transparent the I um yes
so or actually got to build this this this this analytics system I just show researchers here so we've got the access logs
but right here so this also marks they have an IP address and we can see the euro
and we want and opened this stable so it's it's a really simple table reward just keep keep track of how many visitors we get from the US it doesn't even take like time-series source into account it's just like a total count for all visitors the simplest thing you can do and we use to copper topics entry rate tasks to to make this happen and the end goal is to just a day that you really simple actually model so the a model looks like like
this so the stakes in and model has a country codes and his kind of feel the annotation a hash of country guns so we will look for batch and try to fetch the country by clouds of dust and accessible get created and every increment of his counts by the total count it was there was a match so this is this this is like really standard usage of effective record like nothing fancy going on here I the and
this is like the the architecture of the whole thing so we we have to tree rate for tasks on the left side and to got got the topics so the on the right side and then as a model that both of so the 1st we important and these will be written out to to uh topic then there's some preprocessing going on and finally went and gave them and I write it all out to the database you might wonder why do you need the preprocessing steps because and you could basically also just just light at the database straight away and the reason
for this is that often a the data isn't isn't isn't spread out evenly so if you look at the bottom this in this example actually monomers for visitors right from the United States so if you would know immediately rather obvious traffic to a single partition of 1 worker servo will have a 6 times as much work to do as as another 1 and if you need to do some CPU-intensive stuff that 1 worker server might have like use love for the other ones are are almost nothing and that has being a really costlier some passionate fix it you really have to do something else and this is why we were doing uh some part of the work before it actually before we actually do data I'll get to that in so step 1 is so it's important user logs and
this is kind of like cheating a little red because in reality with this this this isn't really streaming data because we I I put a down to a bunch of fashion German put on my left and would this code does is it it loops through all these locked files and just right Simitis messages got 1 by 1 but in reality this all other stuff will be streaming in uh from uh from all kinds of Web so on 1 6 you'll see the copper and only if a message call so this this tells got right think that's a deadline of deadlock climbed to the top of Rob at some point it is done and then it's so it actually important all data step 2 is then
uh of doing the pre-processing so so we now and only still haven't wrong what line and does not being addressed in there and you and we still need to define as what which found countries was actually from so
this uh this is so this is the 2nd step so we've got uh wreck x here that possible plan I also find it somewhere online so this this spits out of the load line into the into a few different segments and then we said at the GUI instance UIP is said that is a way to get a gets get somebody's location-based or IP address and then we said a consumer and we ask it to to read data from the RAF rock page you stop so this is a topic that we were just 1 and then to earlier and were actually getting it in this and the 2nd rate rate task and and then for every message with possible offline with and then we just turn it into a nicely formatted cash so does the dining area at the address country priors on your so so we actually have a proper properly formatted the data we can do something with the final step then we we write this out the 2nd topic so the 2nd comparable contains Jason objects a of that that are most informative but all 151 that actually magic things happening because we're selling petition key so this will will this will tell tell Kafka uh this will help got understand which data goes together so so attitude has same city don't to go to war and in its importation so we know for sure that we can and get a properly label now we get to the
final step so this uh again we
have consumer so we're not actually consumed debates use of these company destinations a and we set up some storage or this a from 162 62 so my 61 this country counts hatch and it's just it's it's moment really hatch and it uses that are still use 0 syntax which means that if if there are no value is present in that hat in the hash for from key it will actually end up being values 0 instead of no so we also always done started a country or a then it loops through all these messages uh adjacent passes it again goes in and got able certain will be serialized so we actually have to look to the back into really object so there we implement accounts and then we increment uh do that that country comes before we just introduced so so a I will get in country when do using that to do hash lookup and then we incremented by 1 so any time we have a visitor from are from some country like this hash will get will be incremental and then we do this thing every 5 seconds so so this is part of the main loop so so every 5 seconds uh this this thing is invoked um and we call that active record a mobile we introduced earlier so this this will actually write out right the whole current state of the country counts as to the database and then I will clear online and free so after this is done will end up with wooden anti buffer again and we can basically prostitution beats so what happens is is dead that this this annotation task is is just lose reading data for 5 seconds is intricate incrementing the counter that had and then after 5 seconds it is wide current state the database and then and will be in there and then just restarts does the whole thing all over again and and and again if
we go back to the real side uh itself from standard so so this is our wife for this and if you look at the controller
adjusts fetching and the country step for the country steps of available in descending that we also have to get to the next summer next to come so we know like how wide the column should be
an interview is just a really simple insurmountable which uh which we look for and and that's it basically so that's kind of interesting thing on my mind is that you can use the said copper principles to conflate Burford room you 2 months incoming data and the and Odesta's reveals that at the end of it which does no fancy stuff and all on the so let's actually look at the damage so that trade that's open
here so this is like the importer or so so this this kind of fakes uh as you might remember being being staged training data just keeps like pushing Rollo climbs into into a cocker topic the so that we wouldn't preprocessor so if you if it passes from the list disk you'll see that actually this is uh an agonist propagation so this this 1 is from Ivory Coast a Firefox browser so this we can easily work with his Jason data the um we could also add a 2nd 1 the so if we had a 2nd preprocessor it will get half petitions and uh and and we'll just you double the capacity of the whole system while it was actually running on a different server of course was my laptop only so much the and and finally we really had data thank you so this is going to help and a ch so if you look at a very where this is going to well but if you refashioned this a couple of times Contact Centre said result but every 5 seconds the reserve will actually increase was we were not actually running out of every single update which is only running out is Burford updates so if you look at the and gate area it's currently it's is running for a for all the countries of the 2 countries that are in our dataset but again and again we can start a 2nd 1 they'll notice that uh actually the lot due to a list of countries and in the 1st 1 will actually decrease it so the next step take of the indicator in the right of top right that there will actually be less countries and the and 1 slightly up so this this list the store her friend and then here on the 2nd day
which goes from here to here and that is the 2nd engageable start at the top and notice that the petition said to be reassigned and then it actually spent out of 2 workers servers uh and and we just double the capacity and that's what you can do with under the computes from presentation and
few that's the question is like 1 of those of consumer dollars but it doesn't committed to offset and actually didn't discuss community offset and presentation this concludes bit simpler that the data it comes down to that that is that the country's consumer can control when it will actually talk about that and it's done with some data so when a consumer dies you can actually run a little bit and just data again so initially that this works out pretty well because you uncomment Weinshall flesh the database and uh and there will be using yes so the question is is there any restriction to the to the to the form of messages and all the answers no so messages subpoena value on both are are a byte array and you can put anything in there that you like so we we use a lot of further but uh in a Kafka topics so would you can also use Jason or whatever format you like that's the question is like you do your own on operations and how how does that and and that's kind of the disadvantage of he's got it's it's it's like you have to dive into a lot of gamma things and you need to ensue EPA this quite of red associated with it you can you can buy local now and also a WSS often called kinases which is often was basically a cocker report it's something on different and so does this always is by from from service providers and if you want to run it yourself you know you you will be in a bit of time to get of what is running ancestry neuroblast but would like to understand configuration and how to monitor it this it's pretty painful the well thank you here we have my on so