So You’ve Got Yourself a Kafka: Event-Powered Rails Services
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 88 | |
Author | ||
License | CC Attribution - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/37335 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Producer | ||
Production Year | 2018 | |
Production Place | Pittsburgh |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
RailsConf 201825 / 88
9
14
16
19
20
22
23
26
27
28
34
35
36
37
38
39
41
42
46
47
53
57
60
62
63
64
69
72
80
85
87
00:00
CausalityEvent horizonProduct (business)Physical systemPlanningMereologyAreaOnline helpInheritance (object-oriented programming)Computer architectureRegular graphPower (physics)Internet service providerService (economics)Channel capacityStructural loadDiagramComputer animationLecture/Conference
02:03
Computing platformProcess (computing)Artistic renderingView (database)Computer wormWater vaporDrop (liquid)Fiber bundleSicOrder (biology)Partition (number theory)Service (economics)Event horizonBoilerplate (text)Computing platformMobile appLibrary (computing)Interface (computing)Streaming mediaClassical physicsOrder (biology)Row (database)TelecommunicationClient (computing)Network socketConnected spaceDifferent (Kate Ryan album)Event horizonPower (physics)Queue (abstract data type)Web 2.0Wrapper (data mining)Cartesian coordinate systemConfiguration spaceTrailWeb pageAnalytic setVolume (thermodynamics)DatabaseService-oriented architectureImplementationRevision controlCASE <Informatik>WebsiteAddress spaceLatent heatThread (computing)Partition (number theory)FrequencyProcess (computing)Data structureWeb-DesignerMultiplicationOnline helpDataflowWeb applicationData storage deviceSimilarity (geometry)Binary codeFile formatKey (cryptography)Validity (statistics)Message passingMultiplication signService (economics)System callComputer animation
08:01
Order (biology)Fiber bundleSicProcess (computing)Partition (number theory)Service (economics)Video trackingPosition operatorScalabilityPartition (number theory)Characteristic polynomialMessage passingEvent horizonService-oriented architectureOrder (biology)BitKey (cryptography)Enterprise architectureTrailNumberScaling (geometry)Overhead (computing)Process (computing)Bookmark (World Wide Web)Set (mathematics)Multiplication signMultiplicationServer (computing)MiniDiscCodeReplication (computing)WebsiteLink (knot theory)Gene clusterProfil (magazine)Service (economics)Distribution (mathematics)Goodness of fitSlide ruleFunctional (mathematics)Case moddingLatent heatRemote procedure callPhysical systemBlogCountingFault-tolerant systemWrapper (data mining)Configuration spaceComputer fileCategory of beingDifferent (Kate Ryan album)Web applicationParameter (computer programming)System callMereologyCartesian coordinate systemGroup actionTerm (mathematics)Context awarenessComputer animation
13:58
Order (biology)Service (economics)Event horizonEmailFreewareArchitectureHeat transferState of matterData typeCodeEvent-driven programmingState of matterEvent horizonQuery languageOrder (biology)EmailService (economics)Process (computing)Arithmetic meanWebsitePhysical systemSystem callPattern languageRemote procedure callTelecommunicationDifferent (Kate Ryan album)Level (video gaming)AdditionMathematicsProcedural programmingRepresentation (politics)Computer architectureCartesian coordinate systemMoment (mathematics)Point (geometry)LogicRegular graphZoom lensINTEGRALSelf-organizationStreaming mediaDependent and independent variablesHeat transferInternet service providerSet (mathematics)Strategy gameOnline helpBookmark (World Wide Web)InformationSoftwareFlow separationLocal ringDatabaseWritingMultiplicationCodeLoginComputer animation
22:04
DatabaseReading (process)Query languageReplication (computing)Gateway (telecommunications)ArchitectureProcess (computing)Fiber bundleEvent horizonTerm (mathematics)Service (economics)Game theoryRippingUniqueness quantificationService (economics)Physical systemWeb pageTraffic reportingEvent horizonLevel (video gaming)Queue (abstract data type)Product (business)Row (database)PressureDatabaseCASE <Informatik>Scaling (geometry)Metric systemStreaming mediaGame theoryProcess (computing)Order (biology)ChatterbotComputer architectureAsynchronous Transfer ModeMathematical optimizationOnline chatMessage passingRevision controlReading (process)BlogConfluence (abstract rewriting)Cartesian coordinate systemIdentifiabilityBitLogicDirection (geometry)System callData storage deviceDifferent (Kate Ryan album)Pattern languageQuery languageWritingOverhead (computing)Network socketConnected spaceService-oriented architectureRoundness (object)Group actionMultiplication signNumberSocial classState of matterMultiplicationComputer filePartition (number theory)Limit (category theory)Online helpData managementComputer animation
30:10
Windows RegistryFormal languageComputer wormKey (cryptography)Type theoryPartition (number theory)Limit (category theory)Scale (map)Streaming mediaComputing platformEvent horizonReal numberEvent horizonFile formatPrimitive (album)Physical systemControl flow1 (number)Service (economics)Complex (psychology)System callInheritance (object-oriented programming)Object (grammar)Computer wormWindows RegistryCentralizer and normalizerField (computer science)View (database)EvoluteBlogDifferent (Kate Ryan album)Slide ruleFitness functionKey (cryptography)Default (computer science)Formal languageOrder (biology)Binary codeEuler anglesSerial portProduct (business)Communications protocolComputer animation
35:16
Row (database)Coma BerenicesBlock (periodic table)Data typeXMLComputer animation
Transcript: English(auto-generated)
00:16
I hope that you're here to listen to me talk about Kafka
00:20
cause that's the room that you are in. So yeah, first things first, my name is Stella Cotton. I am an engineer at Heroku and like I said, I'm gonna talk to you today about Kafka. You might have heard that Heroku offers Kafka as a service.
00:40
We have got a bunch of hosted plans from like tiny plans to giant plans. We have like an engineering team that's strictly dedicated to doing cool stuff to get Kafka running on Heroku in like super high capacity. I am not on that team. If you were here to see that talk, this is the wrong talk. I don't actually know anything about running a Kafka cluster or tuning Kafka
01:02
to handle like super high load. So who am I? I am a super regular Rails engineer like many of you. I wasn't actually familiar with Kafka at all when I joined Heroku. I was like this mysterious technology that suddenly was like everywhere. And like all these hosted Kafka solutions not just on Heroku but at other providers
01:21
and Kafka like systems like Kinesis, they just sprang up. And it seemed important but I wasn't sure why. And then when I joined Heroku, I'm suddenly in this world where like not only is like Heroku an important part of our product offering but it's actually like a really integral part of our system architecture overall. So today I'd like to talk about three areas
01:42
that I hope will help other Rails engineers become more familiar with Kafka. We're gonna start with like what is Kafka. We talk about how Kafka can power your services. And a few just like two practical considerations and challenges that were kind of unfamiliar to me
02:00
when I started using event driven systems. So what is Kafka? The docs on kafka.apache.org, that's actually like the docs are super good by the way. They describe it as a distributed streaming platform. But that doesn't really mean a lot to me as a web developer.
02:22
But one of the classic use cases that people talk about for Kafka is data flow. So if you're running like an e-commerce website for example, you wanna know more about what your users are doing like on that platform. You wanna track each page they visit, each button they click. If you have a lot of users, this can be a lot of data.
02:40
And if we wanted to be able to send that data from our web applications to a data store that our analytics team uses, how could we record and stream that high volume of data? One way is to use Kafka. And the basic data structure that powers Kafka is this idea of an append only log. And when you think about logs, people here,
03:00
like what do you think about? For most web developers, that's gonna be an application log. When something notable happens in our web applications, we're gonna log it in chronological order. We're gonna append each record to the record prior. And then once that record is persisted in this application log, it's gonna be there indefinitely
03:22
until we truncate earlier versions of our logs. As in a similar fashion, Kafka is also an append only log. Kafka has an idea of producers. And those are gonna be applications that produce log events. You can have one producer of events,
03:41
or you can have multiple producers of events. But unlike an application log, which is typically written so that you as a human can consume the log later on using your eyeballs, in the Kafka world, applications are gonna be the consumers of these events. Like with producers, you can have one consumer,
04:01
you can have a bunch of consumers. And a big question for web developers is often like, what's different about Kafka than something like Sidekiq or Resque? And in Sidekiq and Resque, events are typically gonna be added to a queue. But once something picks it off to actually do the work, that disappears. In Kafka, it doesn't matter how many consumers
04:22
are reading these events. Those events are gonna continue to persist for other consumers to consume them until a specific retention period is over. So let's go back to our original example, our e-commerce app. We want this e-commerce application to create an event each time a user does something on the platform.
04:40
We write each of these events, or records, which Kafka, is what Kafka calls events, and you wanna do that to a user event log. And Kafka is gonna call a log a topic, basically. And if you have multiple services, they can all write events to this user event topic. And each of these Kafka records that we write,
05:01
it's gonna have a key and a value, like a hash, and a timestamp. And Kafka's not gonna do any validation on this data. It's just gonna pass along binary data no matter what kind of format it's in. So in this scenario, we're using JSON, but you can use a lot of different kinds of data formats. And the communication that happens between these clients that are writing to Kafka
05:21
and reading off of Kafka is gonna happen over a persisted TCP socket connection. So it's not gonna have a TCP handshake for every single event. So how can our Rails app specifically interact with a Kafka cluster? We can use one of a few Ruby libraries. Ruby Kafka is a lower-level library
05:42
for producing and consuming events. It's gonna give you a lot of flexibility, but it also has a lot more configuration. So if you wanna use a simpler interface without a lot of boilerplate setup, there's Delivery Boy and Racecar, which are also maintained by Zendesk, and also Phobos.
06:00
And these are gonna be wrappers around Ruby Kafka to kind of abstract away a lot of that configuration. You've also got Karafka, which is a different standalone library, and similarly to Ruby Kafka, it has a wrapper called Waterdrop. And it's based on the same implementation that we saw before with Delivery Boy.
06:21
So my team, we actually use a custom gem that predates Ruby Kafka. So I haven't actually used these in production, but Ruby Kafka is gonna be the gem that Heroku is gonna recommend that you use if you look at our Dev Center documentation. So brief look at how you can use Delivery Boy.
06:40
We're gonna use the simple version. Like I mentioned earlier, built on top of Ruby Kafka, and we can use it to publish events to our Kafka topic. First we're gonna install the gem, the usual, run a generator, and it's gonna generate this config file, which you're probably pretty familiar if you've used databases. And Delivery Boy is meant for getting things up
07:01
and running very, very quickly, so you don't have to do any other configuration except the brokers. Give them the address and the port, which might be local host if you're running it locally. And the Delivery Boy docs will tell you how you can configure more things, but this is like the MVP. And using Delivery Boy, we can write an event outside the thread
07:21
of our web execution to a user event topic. You can see we passed in user event as the topic. And each of these topics can be made of one or more partitions. And in general you want two or more partitions. Partitions are a way to partition the data that you're sending in a topic,
07:42
and that allows you to scale your topic up as you have more and more events being written to that topic. And you can have multiple services that are writing to the same partitions. And it's the producer's job to say which partition you're gonna send those events to. Delivery Boy's gonna help you balance events across those partitions.
08:01
You can let it just assign your event randomly to a partition, or you can actually give it a specific key. And why would you wanna give a partition key? It's so that specific kinds of events go inside a single partition. Because Kafka only guarantees that events are delivered in order, like in our application log,
08:21
inside a partition, not inside a whole topic. So for example, if you wanna make sure that your user events related to a specific user all goes to the same partition, you pass in a user ID there. And that way you could make sure that every event related to a user shows up in order, so it doesn't look like they're clicking all around on the website out of order
08:41
because it's going to different partitions. And under the hood Ruby Kafka is using a hashing function. So it's gonna convert this into an integer and do like a mod to divide by the number of partitions. So as long as your partition count stays the same, you can guarantee that it always goes to the same place. It's a little tricky if you start
09:00
to increase your partitions. So now we're writing events to our user topic. That's really the only thing we really need to do if you have a Kafka cluster running. You can also use another gem called Racecar. Same thing, wrapper around Ruby Kafka, maintained by Zendesk. And a same, very little upfront configuration, just this generator.
09:22
This config file will look familiar because it's pretty much the same. But this will also create a folder called consumers in your application too. And an event consumer subscribes to our user event topic with that subscribe to method at the top. And this is just gonna print out any data that it returns.
09:41
So we're gonna run that consumer code inside of its own process. So it's gonna be running separately from the process that runs our web application. And in order to consume events, Racecar is gonna create a group of consumers. It's gonna be a collection of one or more Kafka consumers. And it's gonna read off of that user event topic. And each consumer inside that consumer group
10:02
is gonna be assigned one or more partitions. And each of these are gonna keep track of where they are in these partitions using an offset, which is like a bookmark, a digital bookmark. And the best part is that when one consumer goes away, if it fails, those topics are gonna get reassigned to other consumers.
10:20
So as long as you've got at least one consumer process running, you've got a good availability story. So we talked a little bit about what is Kafka under the hood. And let's talk more about the technical features that make it valuable. So one is that Kafka can handle extremely high throughput.
10:41
One of the key performance characteristics that I thought was pretty interesting is that Kafka has the message broker, rather Kafka does not use the message broker like the Kafka cluster to track where all the consumers are. Like a traditional enterprise queuing system like AMQP is going to actually have
11:03
the event infrastructure itself to keep track of all those consumers. And so as your number of consumers scales up, your event infrastructure is actually going to have more load on it. Because it's gonna have to track larger and larger state. And another thing is that consumer agreement is actually not trivial.
11:21
It's not easy for the broker itself to know where the consumers are. Because do you mark that message has been processed as soon as it gets sent over the network? Well if you do and the consumer is down and it can't process it, how does the consumer say hey event infrastructure, actually can you resend that? You could try a multi acknowledgement process like with TCP
11:42
but it's gonna add performance overhead. So how does Kafka get around this? It just pushes all of that work out to the consumer itself. The consumer service is in charge of telling like where am I, what bookmark am I at in that ordered commit log. And so this is kind of cool because it means that reading and writing event data
12:02
is constant time, O of one. So the consumer either knows that it has an offset, it says give me this specific place in the log or I wanna start at the very beginning and read all the events or I wanna start at the very end. And so there's no scanning over large sets of data to figure out where it needs to be.
12:21
And so the more data that exists, it doesn't matter. It doesn't change the amount of time to look up. And so Kafka is gonna perform similarly whether you have a very small amount of data in your topics or a large amount of data. And it also runs as a cluster on one or more servers. It could be scaled out horizontally by adding more machines, extremely reliable and data is written to disk
12:41
and then replicated across multiple brokers. So it has this like scalable and fault tolerant profile. So if you'd like to know more about like actual data distribution and replication side of Kafka, this is a pretty good blog post. And by the way, the slides will be on the website and I'll tweet them out at the end. So don't bother like trying to write down these links
13:02
cause that would be a little complicated. And so to give you a more concrete number around like what scalable looks like, Netflix, LinkedIn and Microsoft are literally sending over a trillion messages per day through their Kafka clusters. So Kafka is cool, it's awesome.
13:21
It can get data from one place to another but we're like not in a data science track. We're track about services. So like why do you care about this technology in the context of services? So some of the properties that make Kafka valuable for event pipeline systems are also gonna make it a pretty interesting fault tolerant replacement for RPC between services.
13:44
If you aren't familiar with the term RPC, there's a lot of different like arguments about what it means but it means remote procedure call and TLDR, it's a shorthand for one service talking to another service using a request like an API call. So what does this mean in practice?
14:02
Let's go back to our example of the e-commerce site. User is gonna place an order. It's gonna hit this create API in our order service and when that happens, you wanna create an order record, charge the credit card, send out a confirmation email. And so in a monolithic system, this is gonna like usually be like this method call.
14:20
It's gonna have blocking execution you might be using Sidekiq or something to handle sending the email. But as your system grows more and more complex, you might start extracting these out to their own service and you can use RPC, remote procedure calls, to talk between these services. So what's some of the challenges that you might find with using RPC?
14:40
And so one is that the upstream service is responsible for the downstream services availability. If the email system is having a really bad day, the upstream service is responsible for knowing whether or not that email service is available and if not retrying any failing requests or failing altogether.
15:01
So how might an event stream help in this situation? So in this event oriented world, an upstream service, our orders API, is gonna write an event to Kafka saying that an order was created. Because Kafka has this at least once guarantee, it means that that event is gonna be written to Kafka at least once and will be available
15:21
for a downstream consumer to consume. So with our email services down, that event is still there, that request is still there. And when the downstream consumer comes back online, it can pick back up, it can use its bookmark and say, I need to pick back up from XYZ and continue to process those events in order.
15:42
So another challenge that larger and larger organizations find is coordination. In increasingly complex systems, integrating a new downstream service means a change to an upstream service. So if you'd like to integrate a new fulfillment provider, for example, it's gonna kick off a fulfillment process when an order is created.
16:01
In an RPC world, you need to change that upstream service to make an API call out to your new fulfillment service. In event oriented world, you would add a new consumer inside the fulfillment service that's gonna create that order, consume the order created event topic. So there's some upsides.
16:20
What's the big downside? In our first example, the dependency between those services is super clear. Like you look at that method and you see, oh, this, this, this, and this. The upstream service knows exactly which downstream services depend on knowing that the order was created. By abstracting away that connection, you gain speed, you gain independence,
16:40
but you do sacrifice clarity. So you got a Kafka, woo. You're comfortable with the trade-offs, you understand what you're getting into. You'd like to start incorporating events into your event, into your service oriented architecture. So like what might this look like from an architectural perspective? Martin Fowler discusses how when people
17:01
talk about event-driven applications, they can actually be talking about incredibly different kinds of applications. And I found this personally to be true even inside of Heroku. And it can be a big pain point when you're having a discussion about challenges, about trade-offs in these kind of systems. So he's trying to bring a shared understanding
17:22
to what an event-driven system is. And he started outlining a few architectural patterns that he sees most frequently. So I'm gonna kind of zoom through this to cover these. But you can learn more on his website or even better, he gave a keynote at Go2Chicago that really covers this in depth.
17:40
So the first pattern he talks about is an event notification pattern. This is the absolute bare minimum, simplest event-driven architecture. One service simply notifies the downstream services that an event happened. The event has very little information. It's just saying an order was created at this time.
18:01
If the downstream service needs more information about what happens, it's gonna need to make a network call back up to the order service to fill that information out. The second pattern he talks about is event-carried state transfer. And in this pattern, the upstream is actually gonna augment that event with additional information
18:23
so that your downstream consumer can keep a local copy of that data and not have to do that network call. And it can be actually super straightforward when everything that the downstream services needs is encapsulated inside that event generated by the upstream service, which is awesome. Except one of the challenges here
18:41
is that you might need data from multiple systems. So for example, our order service is gonna create an order, write the event. Fulfillment service is gonna consume that event. And the fulfillment service is gonna need certain details about the order, which is totally fine because we have that order information, and it's gonna be passed along inside the event.
19:00
But if it also needs to talk to a customer service, for example, to know who to send a package to, it's gonna either need to make a network call or it's like to actually retrieve that information like we saw earlier, or it needs to find a way to persist a local copy of the data that it needs. And the idea is that you can also consider building
19:23
that local copy off of events that the customer service is gonna write. And so the fulfillment service is gonna consume a separate set of events from your customer service that it can then persist locally and join it inside of its own database.
19:42
A third pattern that Fowler talks about is event sourced architecture. This takes the idea of an event-driven system like even further. And it's saying that not only is each piece of communication between your services kicked off by an event, but it says that by storing every single event
20:02
or storing a representation of every single event and replaying all these events, you could drop all your databases, completely rebuild the state of your application as it exists in this moment just by replaying that event stream. So Fowler talks about an audit log being a good use for this scenario.
20:21
He also talks about this really interesting high-performant trading system, which is not really something that's relevant to my interest but might be relevant to yours. But having an audit log of everything that happens and actually relying on it to rebuild your application state is two very different levels of technical commitment,
20:42
like seriously different. There are gonna be additional challenges that come into play when you're looking to be able to recapture this state of the world. So one is code changes. So I worked on a payment system in the past and a big challenge that we had was if you tried to recalculate prior financial statements,
21:02
they depended on business logic that lived inside the code, hard-coded in the past, and that had changed. And so if you tried to recalculate all that money that you needed to send out to users based on the state of the world that day, you might actually get different values, and that's a problem. And that was not an event-driven system, so this is an issue even in regular systems,
21:24
but it's something that you really need to understand when you're relying on the ability to rebuild your state of the world from events. And in a similar fashion, another challenge is that that state of the world might not be the same for third-party integrations. Like that might have changed,
21:41
and an API callout to a third-party provider might return a different set of data, or even an internal provider. Fowler talks about some more strategies for handling these issues. It's not like an insurmountable task, but it is a big deal. And the final pattern that Fowler talks about is command query responsibility segregation,
22:01
which if you've ever talked to anybody about event-driven architectures, you might, like I would say 50% of the time, they think you're talking about this, and you may or may not be talking about this. So you might be familiar with CRUD, and that's a way of encapsulating logic for creating, reading, updating,
22:20
deleting a record in a database inside your service. And it's at the heart of Rails controllers, like this is very Rails-y. But the idea of CQRS is that instead of thinking about all those things as existing in the same domain, that you can actually split them out. So the service that writes to your system,
22:41
and the service that reads from your system, are actually split. And at its simplest, one service is responsible for writing orders, so any method or API call that feels like an order.create, an order.update taxes, like whatever, that's gonna go into your order updater service.
23:00
It's gonna go one direction. And reading methods, or API calls like getOrderByID, those are gonna live in another service. And it's not just an event-oriented architecture thing. You can do this with API calls. But you'll often see event-oriented architectures. You'll see event systems nestled in the diagrams,
23:23
usually at the place where commands are actually written. The writer service, this command handler, is gonna read off of the event stream, process these commands, store them to a write database, and then any queries are gonna happen to a read-only database. When you wanna use it, CRUD is much simpler.
23:42
Separating out read and write logic into two different services really does add a lot of overhead. The biggest reason I think people use it is performance. So if you're in a system where you have a large difference between reads and writes, or a system where your writes are super, super slow compared to your reads, you can optimize performance
24:02
separately for those systems. But it's an increase in complexity, so you really need to understand the trade-offs. I found this blog post to be pretty good, succinct, like the title says, when to use and not use CQRS. And I would also, if you're thinking about implementing any of these patterns in your system,
24:20
take a look at Fowler's blog post and watch his keynote to get a really deep dive into the trade-offs and considerations for these. So we talked about what Kafka is, talked about how this technology can change the way your services communicate with each other. So next let's talk about two practical considerations
24:40
that you might run into when integrating events into your Rails applications. It's definitely not a comprehensive list, but it was two things that really surprised me when I got started. So the first thing to consider are slow consumers. The most important thing to keep in mind in an event-driven system is that your service needs to be able to process events as quickly
25:02
as the upstream service produces them. Otherwise you are gonna drift slowly, more slowly and slowly behind, and you may not be realizing it because you're not seeing timeouts, you're not seeing API calls fail, you're just slower, you're just at a different state of the world than everybody upstream.
25:21
And also you might start to see timeouts, so one place where you will see timeouts are gonna be on your socket connection with the Kafka brokers. If you're not processing events fast enough and telling and completing that round trip, your socket connection can timeout, and then reestablishing that actually has a time cost, it's expensive to create those sockets,
25:41
so that can add latency to an already slower system making things worse. So if your consumer's slow, how do you speed it up? So there's gonna be a lot of talks at RailsConf today and RailsConf prior about performance in Rails. But a more Kafka-specific example is that you can increase the number of consumers
26:01
in your consumer group so that you can process more events in parallel. So what does that look like in practice? So you remember in racecar you run a consumer process by passing in a class name like user event consumer. If you're using a proc file for something like Foreman or on Heroku, you can actually start racecar with the same class multiple times,
26:21
and it's gonna automatically have each of those processes have it be an individual consumer that's gonna join the same consumer group. And so you want at least two of these consumer processes running, so if one goes down you're gonna have the other failed partitions be reassigned. But it also means that you can parallelize work across as many consumers as you have topic partitions.
26:44
So of course with any scaling issue you can't just add consumers forever. Eventually you're gonna hit scaling limits on shared resources like databases. So you just can't scale forever, but it will give you a little bit of wiggle room. And if you don't actually care about
27:01
the strict ordering inside those topic partitions, you can use a queuing system like Sidekiq to help manage work in parallel and take a little bit of pressure off of the system. And it's extremely valuable in this case to have metrics and alerting like paging around how far behind you are
27:20
from when an event was added to the queue. So Ruby Kafka is instrumented with active support notifications, but it also has a StatsD and Datadog reporters that are automatically included. And you wanna know if you're drifting a certain amount behind when the events are added. And the Ruby Kafka documentation,
27:40
which is also super good, has a lot of suggestions for what you should be monitoring in your real systems. And going one step further, on my team before we put a new service into production, because we use Kafka for a lot of our services, I will run game day failure scenarios on staging but consuming off the production stream where we'll move our offset back by a certain amount
28:02
to pretend like our service has been down for a day or two and see how long it takes us to catch up. Because if you have a serious outage on your downstream service, you're gonna wanna know how much will room do you have and talk about what failure modes are acceptable for your service. Because if you're running an order service, for example,
28:20
you might wanna optimize for finishing those orders no matter what. It might take you four hours, it might take you a day, but you want it to be done. But if you're running a chatbot service, users are gonna be extremely confused if four hours later, they start seeing things appear inside of their chat. So you'll wanna talk about what that means to process data at a delay.
28:42
And there are also, so a big thing in Kafka to talk about is this exactly once versus at least once. You can actually design a Kafka system to have an exactly once guarantee with newer versions of Kafka. So it means that consumers could assume that a message in a Kafka queue has only been sent once.
29:00
There's never a chance that messages will be not sent at all or that it'll be sent more than once. And Confluent has a blog that talks a little bit more about this, but the reality is with Ruby Kafka, you should assume that your messages will be delivered at least once. So it's gonna be there, but you might see it more than once. So that's really important.
29:21
And one thing that this means is that you need to design consumers to expect duplicated events. So you can either rely on your database and use like upsert for item potency or... Yes, thank you so much. Got the dance in too.
29:41
All right, so design your consumers for failure. Rely on upsert to lean on your database. Or you can include a unique identifier in each event. And you can just skip events that you've already seen before. But either of those things are gonna make your application more resilient
30:01
in the face of failure, because you can move back and offset and replay events and not be worried that you're gonna duplicate the same work twice. And the second thing that was really surprising to me was Kafka's very permissive attitude towards data. You can send anything in bytes
30:22
and it's gonna send it back out. It's not gonna do any verification. This feature makes it extremely flexible because you don't need to adopt a specific format, like a serialization format to get started. Like we talked about before, freedom isn't free, freedom is a blessing and a curse. What happens when a service upstream decides to change an event that it produces?
30:42
If you just change that event payload, there's a really good chance that one of your downstream consumers is gonna break and your colleagues are gonna be really upset. And they're gonna start seeing exceptions, could take down their service, but you have no idea because you didn't really know that anybody was using that data. So before you start using events in your architecture,
31:02
choose a data format, preferably one, not multiple ones, and evaluate how that data format can help you register schemas and evolve them over time. It's an issue in RPC systems as well, but the explicitness of those calls means that somebody changing an upstream service
31:21
has a built-in method for being like, oh wait, somebody might actually be using this information, especially when that's like an outgoing event. Schema registries can provide a centralized view of schemas that are in use in your system. This is also, it's just like much easier
31:40
to think about validation, about evolution of schemas before you actually do all of this event-driven architecture than afterwards. And there are a lot of different data formats, but I'm gonna talk about two, JSON, which most people in this room are probably familiar with, it's the format of the web,
32:00
and Avro, which might be new to some of you, and I know there are probably at least two Australians in the audience. There's like a 75% chance I'm gonna slip up on the next two slides and call it Arvo, but I do not mean afternoon, I mean Avro, so just ignore me. So JSON's a pretty common data format. Pros, human readable, super nice.
32:21
As a human, I really appreciate that. And there's also JSON support in basically every language. But there are a couple cons. The payloads, like the actual size of JSON payloads can be really large. It requires you to send keys along with the values, which is nice because it's flexible, so it doesn't matter what the order is in. But it's also the same for every payload.
32:43
So when you send them over, it can mean that those payloads are larger than other formats, so if size is a big concern for you, JSON could be a bad fit. And there's also no built-in documentation inside that payload, so you just see a value, but you don't actually know what it means.
33:00
And schema evolution is challenging. There's no built-in support for a-listing one key to another, if you wanna rename a field, or sending a default key value so that you can evolve over time. So Confluent, they're the team that built Apache Kafka, and they now have a hosted Kafka product. They recommend using Avro,
33:22
like many other protocol formats. It's not human-readable over the wire, you have to serialize and deserialize it, it's sent, because it's sent in binary. But the upside is that there is super robust schema support. A full Avro object is sent over, and it includes this idea of the schema and the data,
33:41
so it has support for types, it's got primitive ones like ints, complex ones like dates, and it includes embedded documentation, so you can actually say what each of those fields actually do or mean in your system. And it has some built-in tools that help you evolve your schemas in backwards compatible ways over time, so that's pretty cool.
34:02
Avro Builder is a gem created by Salsify, and that actually creates a very Ruby-ish DSL that helps you create these schemas. So if you're curious to learn more about Avro, the Salsify engineering blog here has a really great write-up on Avro and Avro Builder.
34:21
So that is it for my talk today. I have a couple more slides though, so don't start leaving, because I always feel really weird when people are like, and thank you, and then they have a couple more slides. So if you're curious to learn more actually about Heroku and how we run our hosted Kafka product, or how we use it internally,
34:40
we do have two talks that my coworkers, Jeff and Paul, have given that you can watch. And if you're interested in Postgres, related to Kafka but different, my coworker Gabe is giving a talk, like literally in the next talk spot in room 306, about Postgres 10 performance and you. And then finally, I will be at the Heroku booth today
35:01
after this, and like during lunch, if you wanna come by and say hi, you can ask me questions, you can get swag, we've got t-shirts and socks, and we also have some jobs, you know, the ush. So yeah, thank you so much.