Kafka Monitoring: What Matters!
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 56 | |
Autor | ||
Lizenz | CC-Namensnennung 3.0 Unported: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/67178 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
Berlin Buzzwords 202226 / 56
22
26
38
46
56
00:00
MAPAnalysisLeistungsbewertungKonfiguration <Informatik>Metrisches SystemStreaming <Kommunikationstechnik>Notepad-ComputerProzess <Informatik>MaßstabDivisionPartitionsfunktionGruppenkeimKonsistenz <Informatik>Formation <Mathematik>Güte der AnpassungMetrisches SystemGruppenoperationTaskPartitionsfunktionServiceorientierte ArchitekturKonfigurationsraumKonfluenz <Informatik>DatenverwaltungProtokoll <Datenverarbeitungssystem>EntscheidungstheorieAggregatzustandWeg <Topologie>SpeicherabzugFokalpunktSichtenkonzeptMAPWiderspruchsfreiheitVerknüpfungsgliedEchtzeitsystemLogischer SchlussFlächeninhaltFehlertoleranzLuenberger-BeobachterSoftware EngineeringAnpassung <Mathematik>Fundamentalsatz der AlgebraZahlenbereichOpen SourceKartesische KoordinatenExogene VariableKoordinatenDatensatzMessage-PassingProzess <Informatik>CASE <Informatik>EinfügungsdämpfungSchreiben <Datenverarbeitung>SchnittmengeServerWarteschlangeVersionsverwaltungKonfiguration <Informatik>DatenparallelitätRechter WinkelLokales MinimumQuellcodeFinitismusProdukt <Mathematik>Physikalisches SystemGanze ZahlMultiplikationGamecontrollerÄhnlichkeitsgeometrieSpezifisches VolumenBetrag <Mathematik>Twitter <Softwareplattform>VerfügbarkeitLeistungsbewertungSynchronisierungVirtuelle MaschineBitrateKontextbezogenes SystemDienst <Informatik>LoginSelbst organisierendes SystemStatistikBildschirmmaskeSystemplattformFramework <Informatik>NummernsystemAlgorithmusStreaming <Kommunikationstechnik>MereologieGemeinsamer SpeicherLesen <Datenverarbeitung>Schnitt <Mathematik>PolarkoordinatenXMLUMLBesprechung/InterviewVorlesung/KonferenzComputeranimation
09:11
Konsistenz <Informatik>Funktion <Mathematik>PartitionsfunktionMetrisches SystemVisualisierungBitrateÄhnlichkeitsgeometrieInformationsspeicherungTermSoftwaretestLastMetrisches SystemInverser LimesPhysikalisches SystemFlächeninhaltZusammenhängender GraphKlasse <Mathematik>SoftwareATMKartesische KoordinatenSichtenkonzeptMultiplikationsoperatorElement <Gruppentheorie>AggregatzustandBitratePartitionsfunktionMessage-PassingGMXRoutingWiderspruchsfreiheitSystemverwaltungServiceorientierte ArchitekturMAPMetadatenEinfache GenauigkeitDatensatzClientAppletEinsLuenberger-BeobachterVisualisierungDatenreplikationDialektDistributionenraumEigentliche AbbildungErwartungswertStapeldateiLokales MinimumEinfügungsdämpfungAuflösung <Mathematik>KonzentrizitätGanze FunktionZählenServerHalbleiterspeicherBefehlsprozessorRohdatenProgrammierumgebungOrtsoperatorSchaltnetzSchedulingZahlenbereichMusterspracheGamecontrollerSynchronisierungVerknüpfungsgliedVektorpotenzialObjekt <Kategorie>Weg <Topologie>Betrag <Mathematik>UmwandlungsenthalpieRelativitätstheorieDateiformatDeterminanteFunktionalanalysisGruppenoperationKonsistenz <Informatik>Physikalischer EffektWurzel <Mathematik>MereologieProduktion <Informatik>PunktwolkeMultiplikationIndexberechnungKonfigurationsraumZeitreihenanalyseLastteilungDefaultStabilitätstheorie <Logik>Open SourceZehnBandmatrixRechter WinkelSchnittmengePunktDatumsgrenzeDienst <Informatik>Güte der AnpassungCASE <Informatik>Total <Mathematik>Patch <Software>Schreiben <Datenverarbeitung>MatrizenrechnungFunktion <Mathematik>Ein-AusgabeFramework <Informatik>Formation <Mathematik>TropfenKonfluenz <Informatik>DatenverwaltungComputeranimation
18:06
PartitionsfunktionServerWiderspruchsfreiheitZahlenbereichSynchronisierungStrom <Mathematik>GamecontrollerTDMAKonfigurationsraumMetrisches SystemInverser LimesSimulated annealingBitrateLeistungsbewertungSchreib-Lese-KopfServiceorientierte ArchitekturServiceorientierte ArchitekturFehlermeldungMAPZweiPartitionsfunktionZählenGruppenoperationSoftwareSchiefe WahrscheinlichkeitsverteilungProzess <Informatik>LastZahlenbereichHardwareCASE <Informatik>UmwandlungsenthalpieBitrateAbstandDifferenteTypentheorieDistributionenraumStapeldateiDateiformatGreen-FunktionLokales MinimumOpen SourceWeg <Topologie>SynchronisierungZeitreihenanalyseHydrostatikLuenberger-BeobachterWidgetDatenverwaltungServerBildschirmfensterRechenschieberE-MailDatenparallelitätTermDienst <Informatik>Basis <Mathematik>SchaltnetzEichtheorieMessage-PassingMetrisches SystemGamecontrollerVisualisierungStochastische AbhängigkeitEreignishorizontAnalysisAusgeglichener BaumDatensatzOrtsoperatorRechenzentrumGanze ZahlTwitter <Softwareplattform>Regulärer GraphTotal <Mathematik>Fundamentalsatz der AlgebraMinimumMultiplikationsoperatorInformationsspeicherungDatenreplikationLeistungsbewertungMini-DiscElektronische PublikationTeilbarkeitGlobale OptimierungZentrische StreckungLastteilungMatrizenrechnungGMXEntscheidungstheorieCachingSchwellwertverfahrenStrömungsrichtungCodeDynamisches SystemPlotterAggregatzustandPartikelsystemDialektEchtzeitsystemDatenbankComputeranimation
27:01
Divergente ReiheMetrisches SystemLeistungsbewertungDateiformatSchreib-Lese-KopfServiceorientierte ArchitekturBitrateAnalysisPartitionsfunktionBitrateThreadMAPTwitter <Softwareplattform>Serviceorientierte ArchitekturCASE <Informatik>ErwartungswertBeobachtungsstudieRechter WinkelGüte der AnpassungBetrag <Mathematik>GamecontrollerYouTubeService providerDienst <Informatik>MatrizenrechnungSystemplattformCloud ComputingHypermediaPhysikalisches SystemWeb logPunktwolkeZeitreihenanalyseGoogolLuenberger-BeobachterMessage-PassingVektorpotenzialMultiplikationsoperatorPunktLeistungsbewertungSystemverwaltungGraphWeb SiteServerGruppenoperationMetrisches SystemMinimumAnalytische FortsetzungRechenschieberFunktionalanalysisKonfigurationsraumStatistikSichtenkonzeptPaarvergleichRechenwerkLastComputerarchitekturVisualisierungFundamentalsatz der AlgebraFramework <Informatik>Overhead <Kommunikationstechnik>Lokales MinimumElektronisches ForumHardwareFrequenzGeradeTermRelativitätstheorieÄhnlichkeitsgeometrieIntegralDifferenteInverser LimesAnalysisMini-DiscSoftwareProzess <Informatik>BildschirmfensterFormation <Mathematik>InformationsspeicherungHalbleiterspeicherBefehlsprozessorEigentliche AbbildungKartesische KoordinatenAggregatzustandProduktion <Informatik>Minkowski-MetrikEntscheidungstheorieFehlermeldungMultigraphWeg <Topologie>Vorlesung/KonferenzComputeranimation
35:56
Formation <Mathematik>Vorlesung/KonferenzXMLUML
Transkript: Englisch(automatisch erzeugt)
00:08
Hi, good afternoon, everyone. It's good to be back at Buzzwords in the city of Worling in person. I extend my welcome to everyone here this week. My name is Amrit Sarkar, and in this session, we will try to form a holistic view of monitoring
00:22
Apache Kafka cluster, running on multiple servers, taking care of real-time messaging use cases. I work as a software engineer in India, and this is my first year looking at Kafka from ObservabilityView, and not try to build or improve use cases around search, which was my main focus in the last seven plus years.
00:45
So we start our discussion with spending first few minutes on understanding how Kafka works internally. It is important to understand the core fundamentals of tools being used to know where to look when something goes wrong. We will understand the need to go beyond monitoring options to adapt observability mindset.
01:03
Further, we focus on main performance areas around Kafka, classify and dissect the metrics available. We learn how to interpret this metrics to observe our cluster much better. In the later part, we focus on taking inferences from consumer lag, a very vital performance gate to evaluate overall Kafka performance in itself.
01:23
And we end this talk by looking at what trends we need to be aware about to take functional and technical decisions for Apache Kafka. All my learning discussed are from observing Kafka from our saving services in last year, and I welcome experienced Kafka practitioners to share their insights at the end of the session.
01:45
So a messaging system is responsible for transferring data from one application to another. So the application can focus on data and not worry about how to share it. Apache Kafka is a distributed published subscribe messaging system, a robust queue
02:01
that can handle a high volume of data and enables you to pass messages from one endpoint to another. In the published subscribe system, messages are persisted such that anyone can subscribe to consume and process based on their use case. Kafka is suitable for both offline and online message consumption.
02:21
It supports low latency message delivery. It is highly available, gives guarantee for fault tolerance in the presence of machine failures. Now Kafka can be used for a number of use cases like monitoring data involving aggregating statistics from distributed application to produce centralized feeds. It can be used across organization
02:41
to collect logs from multiple services. Kafka's strong durability is a very useful in the context of scheme processing where popular frameworks like Spark and Strom use Kafka as a middle layer for processing incoming data. Kafka in itself is a unified platform for handling all real time data feeds.
03:01
Now producers are responsible to write data in the form of messages to the Kafka cluster. A Kafka cluster comprise of number of Kafka servers called brokers where messages are persisted in distributed and replicated manner. We will go into this detail later. Producers write Kafka in a unique logical queue called Kafka topics in the brokers.
03:24
Consumers subscribe to these topics to read the incoming messages. Brokers until Kafka 3.x version used zookeeper as a coordination tool to manage and share cluster state. In the newer versions, zookeeper's need is deprecated and will be removed in the future.
03:48
So Kafka topic as mentioned are logical queues which can further be partitioned to achieve concurrency. These partitions are then further consumed by one or more consumer groups. Each consumer group are comprised of multiple consumers
04:03
responsible to read records from one or more partitions in the topic. We're calling each message entity in the partition as record. In this case, we have three partitions for a topic and respectively a consumer group with three consumers taking care of one partition each.
04:21
The goal for a consumer group is to read all records of all partitions for this particular topic. Now a unique integer offset value is associated with each record at each partition here, zero, one, two, three and so on. Consumers keep track of this offset in a persistent manner to read the next record.
04:42
When a message is written to the topic based on the internal criteria, basically a hashing algorithm, the new record or message is recorded to, is written to one of this partition. Producers can write or message or a record to a specific partition also.
05:01
Now what happens if a consumer in the group fails or removed by the user? In this case, partitions to consumer has been removed from the group. Internally, one of the brokers act as group coordinator and sends heartbeat to each of these consumers to know its active status.
05:24
Now here, one of the active and existing consumers will take the responsibility to consume the remaining records from partition two. Since the offsets for partition are tracked on a group level, no loss of data is encountered. And a group consumer leader elected by the group coordinator is responsible for it
05:41
when consumers are added or removed. Now I understand I'm name dropping on a bunch of terminologies and we who are not familiar with Kafka much, but getting to know these nuances will really help us to understand what to monitor and what they represented essentially. So in a Kafka cluster, there are multiple brokers
06:02
running on servers providing durability along with high availability to the topic. In this case, we have two brokers, broker one and broker two, and topics are divided and replicated in the following manner. We have topic T1, which has two partitions, P0 and P1.
06:20
Each partition has two replicas, R1 and R2. Broker one contains one set of topic, while the broker two contains the other set. For each partitions, there will be a leader replica which will be responsible for both read and write. Producers write messages to topic T1.
06:41
For partition zero, they will be first written to replica R1 on the top and also consumed from R1 replica only. Internally, broker will further propagate these messages to follower replicas for a partition. For partition P1, this role will be played by a replica R2 in broker two.
07:00
Using single replicas, both read and write ensures consistency at all levels. Taking another example for topic T2, we have a single partition with two replicas and their leader resides on broker one. And with similar configurations for topic T3, the leader resides on broker two.
07:23
Now with so many dynamic pieces, one of the brokers takes up the role of the controller who performs various administrative tasks. All the create, update, delete topic requests are processed by the controller broker. If a broker which is not a controller fails,
07:42
few of the leader partition replicas may become unavailable. This controller will reassign this leadership role to other in sync follower replicas and keeps constant track of the replica state for further decision. Controller acts as a brain of the cluster for partition leadership management.
08:01
I have abstracted a great deal of internal and design details which makes Kafka probably one of the best maintained and improved open source technologies. There isn't a lot of work being done lately with care of protocol implemented in brokers, phasing out zookeeper's role of cluster management and election commission. Please check out Confluence or Kafka Summit stocks
08:21
to know them better in detail. Okay, so now before moving on to the nitty gritty of metrics available, we need to first summarize the performance areas one can improve while running multiple Kafka pipelines. Throughput and latency goes side by side with distributed system and so does with Kafka.
08:41
There needs to be a consistent focus to improve or optimize the production and consumer rate with finite sources. Having clarity on the maximum possible throughput helps determine batching while streaming data into the cluster. While consumers on the right hand side must be able to read or process without compromising or being too much behind.
09:03
A consumer being behind producer is denoted by consumer lag unless the absolute value of the lag is the better. Another principle we need to be ensured is data integrity. Kafka provides to enable durability on writes and reads which makes it really convenient
09:21
to fit the desired throughput. While there is some latencies involved with that durability. Producers can push messages to broker and either wait for single acknowledgement of the leader replica or acknowledgement from all the replicas or don't wait for acknowledgement at all. The more acknowledgements needed, the latencies will be higher for write
09:41
and it compliments, but it compliments the best durability too. On the contrary, production rate or throughput will be high if messages are pushed to brokers and we move on to the next batch. One should have correct expectations from the cluster based on the level of durability adopted. Now proper distribution of brokers across servers,
10:01
racks, regions with replications ensures no business impacts on failures. Careful concentration needs to be taken to formulate the cluster state configuration while setting up Kafka cluster for a testing environment or a production environment. And finally, we look at the raw resource usage at the server storage and network level.
10:22
Both CPU's and memory work best in terms of performance and cost when they are neither overutilized or underutilized. Determining that sweet spot for a cluster can take a while, but performing load testing really helps us to get there. In terms of storage or network bandwidth, adopting best in class components
10:40
helps to avoid unnecessary hygiene based problems. Now before moving on to classify the performance areas around the Kafka components, let's try to understand why do we need, why do we need to go beyond bare monitoring solutions and build failed works to observe your system.
11:02
Monitoring typically provides a limited view of system data focused on individual metrics. This approach is sufficient when we know where the system fails. And monitoring tends to focus on indicators such as throughput or utilizing rates, which indicates overall system performance.
11:21
As applications become much more complex, so do their failure modes. And it is often not possible to predict how distributed applications will fail. By making a system observable, you can understand the internal state of the system and from that you can determine what is working and what is not and why it is not working.
11:41
Apache Kafka in itself does not report problems, it only report metrics. One of the elements we discussed in the basics was controller broker. And there is a metric called active controller count, which must be one at all the times. If there are more controllers, it will lead to split-brain problem. Alerting, which is part of monitoring,
12:02
can tell us there is an issue. But good observability can help us determine the root cause like servers restarting, anti-picklock failing, et cetera, network packet dropping, which can potentially lead to the multiple cloud controllers part.
12:22
Now each component in the Kafka cluster from left to right act as a performance gate in the system. We start with producers, throughput of pushing messages must be tracked, abnormal spikes must be investigated. It can be uptrend or a downtrend.
12:40
When it comes to the internal components within Kafka, keeping tabs on the topics helped within the broker is desired. If one or set of brokers goes away, having an oversight of topics become unavailable for pushing messages is important. As multiple brokers are part of cluster hosting tens of thousands of partitions,
13:00
stable distribution of these partitions across servers based on server resources is recommended. So keeping track of number of partitions we've hosted along with bytes in or out of each broker helps us observe load distribution. In terms of topic, going one level further within the brokers,
13:21
visualizing each individual partition health in terms of availability via replication, leader partition replicas bytes in and out becomes a potential performance gate. Though by default messages are equally distributed across partitions, users can direct messages to specific partitions via custom routing and hence unstable partition load distribution can be formed
13:43
and thus needs to be in check. Consumer rate similar to producers must be tracked along with how much a consumer is lagging behind. Lag in absolute and relative terms is the single most important metric to be tracked to determine Kafka cluster performance as a messaging system.
14:02
And finally, we look at the elements binding the systems together. Network latencies between servers on one or more regions, CPU, memory, input, output, spikes contributing to distributed system performance as a whole.
14:20
Now there are a bunch of ways to get Kafka and underlying infrastructure metrics into time series format to visualize and take actions on top of them. For this talk, we have taken the popular route of exposing GMX metrics of each broker to Prometheus server and visualize on Grafana. ZooKeeper Ensemble, if still being used, can also export these metrics to Prometheus by inbuilt exporter itself.
14:44
Now with its rich, vibrant, open source ecosystem of Kafka, administrative monitoring, alerting and observability frameworks, both paid and free are available with detailed documentation. Some popular ones are Confluent Control Center, CalfDrop, Yahoo Kafka Manager, Cruise Control from LinkedIn and bunch of other tool mentioned here.
15:02
Please feel free to add more good ones recently being introduced. Okay, so we start with the producer rate. Now there are a number of ways you can push data to Kafka against a topic. One such function is send function, which is available in a Kafka Java client.
15:21
When successfully processed, it returns a record metadata object. This object has an offset long value representing records position per topic partition combination. It's an absolute position in the partition itself. If you push this offset value through instrumentation to Prometheus and run one minute rate on top of it and visualize,
15:43
we can start seeing patterns on the time schedules. In this case, we are seeing a decent spike in the incoming traffic around midnight, around 00.00. There are other ways to determine the latest current offset value at the broker level. Broker API itself also exposes this offset value
16:00
for each topic partition combination. Now there are four key metrics in the Kafka system, which we can expect an alert as soon as something goes wrong. Starting with ACC, which is the active control account, which if more than one can result in executing
16:21
cluster administrative commands without synchronization and consistency. And ultimately your entire Kafka cluster ends up in a total failure state. There is a dedicated JMX metrics for it. And as we learn together, the value of this metrics must be one as else an alert must be raised. We further move on to making sure all the partitions
16:44
for all the topics in the cluster are available and not offline. There is a dedicated metric which represent partitions are offline, irrespective of any topic. And this singular metric value must be zero at all the times, or we should be alerted. These offline partitions are neither available
17:01
for read or write, and generally caused by cascading failures on the servers. The JMX metric name is offline partition count. Now, when we establish replication for partition across, we also have to set minimum value for ISR. ISR is in sync replicas for the partition.
17:23
Kafka brokers returns back the acknowledgement for a successful write for a batch, only and only if the data is replicated across a minimum number of replicas. If at any point of time, these in sync replicas are lower than what is configured, an alert must be raised
17:41
since producer write will be halted until resolution. The JMX metrics available at Kafka is under min ISR partition count, and as name suggests, it denotes number of partitions unavailable in this discussed situation. In idle and all cases, the value should be zero always for under replicated partitions itself.
18:02
If we have more than one value for under replicated partitions, that means that we are more vulnerable to data loss if there is a leader failure. Now, we discussed some very definite metrics here and to raise alert if constant value is not fetched.
18:20
Further on an analysis front, keeping track of consumer lag per partition, per topic provides us understanding of how the streaming in Kafka is performing. We will discuss the consumer lag in detail with observability eyes later. Now, static values on Grafana
18:41
are best denoted as countermetric. And with potential alert manager sending necessary notification or both good and bad states, metrics are pushed with bunch of labels and thus can be used to plot dynamic counter widget at topic level for verbose visualization. We have ACC active controller count as one
19:01
and zero offline partition counts for both the brokers. There are no minimum under in sync replica partitions across 10 plus topics hosted on this cluster. This is typically a very happy cluster and something goes wrong, the green will become red.
19:23
Okay, moving on to further, moving on digging into further into the brokers health metrics. And we start with load distribution among brokers for hosting partition of different topic. In most of the use cases around Kafka cluster, distribution of partition of all types of topics,
19:42
batch or streaming is done almost equally among the brokers. There can be specific business case where data needs to be written to a specific partition based on proximity regions, design decision, et cetera. Keeping a check on number of partitions per broker helps to understand, helps to avoid heavy load on a broker level.
20:00
Though it is very subjective to actual bytes in or out of a broker. JMX metrics named partition count is available for this. With actual bytes in and out, having a holistic look at network behavior, success or error helps indicate the true load skewness if it exists.
20:20
Along with foreseeable issues with general data center setup, including hardware. Network requests per second and error rate per second are tracked in a time series format. Now each broker has a log file per partition per topic
20:41
where new data gets appended on disk. These cache-based write are flushed to disk in certain format based on Kafka internal factors and they are performed asynchronously so as to optimize performance and durability. The longer it takes to flush that log to disk, the pipeline, which is producer to consumer,
21:01
backs up more and the latency can get worse. When the latency goes up, even as a small increment of 10 millisecond can balloon up and can lead to under replicated partitions which we discussed in the previous slide. If the latency is going high, it indicates towards better adoption of hardware or scaling.
21:22
We have a dedicated JMX metrics to see the log flush latency. Now fetcher lag. The lag in messages per follower replica is aggregated for a topic partition at a broker level. This allows monitoring the broker's capability to keep replicas in sync within the partition.
21:43
The expected value for this metric should be zero or very low, indicating the broker is able to keep up with the replication. If the metric is increasing, it indicates a problem. It probably can be a high producer rate or a fundamental server problem.
22:03
Now visualizing these metrics together gives a much better idea. Here, clearly broker zero, who is hosting 68 partitions in total is doing much more work than the broker one, holding only 15. The load skewness is evident in the network request per second in the top left corner.
22:23
And it is also evident in the log flush rate because since it is taking care of more partition. Even though broker zero is hosting four times more partitions than broker one, the log flush time, which is on the left bottom corner, is not affected and both the brokers are on par with each other.
22:41
This is another happy case where the fetch lag is absolute zero with no follower partitions being behind. Note all the partitions are at the broker level and this is also another happy case where even though there is a load skewness, that particular broker is able to keep up with the incoming traffic.
23:04
Okay, now we move on to the consumers, the last piece of the pipeline. Now there are two types of offsets which are kept at a partition level, a current offset and a committing offset. Current offset is a position where the next record will be fetched when it is available.
23:20
It is a simple integer or a long number that is used by Kafka to maintain the current position of the consumer. While there is something another called committed offset, that is a position consumer has confirmed about processing. We don't need to maintain messages until this. We don't need to maintain the committed offset anywhere. Committed offsets are used to avoid sending
23:41
same records to a new consumer in the event of a partition rebalance. Now we have two consumer groups here, consumer A and consumer B, and for both the current offsets, current offsets are four. All the messages have been propagated to their respective groups and being processed at their own speed.
24:01
Group A is at four, hence the committed offset is also four for it, while for consumer B, the committed offset is two. These committed offset values of respective groups are then written internally by brokers to an independent internal Kafka topic double underscore consumer underscore offsets
24:22
to keep track in the future. We can push these offset values to some other custom topic based on our requirement. Now this particular topic, double underscore consumer underscore offsets, can further be consumed by the same Kafka cluster
24:41
with dedicated consumer, and we can write these long values to our time series database, like Prometheus, and visualize on Grafana like this. This offset long value can be emitted for every consumer for a topic down to its individual partition. There are a number of tools available, both open source and paid, which does all of this for us,
25:00
and we can visualize true consumer offset and consumer rate on a time series basis. We're going to discuss one in the next slide. Now Burrow, I personally like it very much, is LinkedIn's open source monitoring companion for Apache Kafka that provide consumer lag, checking as a service, without the need for specifying thresholds.
25:23
It exposes consumer offset lag as a gauge metric to Prometheus in a consumer partition combination for a topic, and it makes it very convenient to visualize near real time. We will understand what lag specifically denotes in the next slide. Now Burrow automatically monitors
25:42
all consumers using this Kafka committed offset. It registered a consumer group, read committed offsets, and makes metrics in the next sliding window manner available for evaluation. Burrow has been written in Golang. It has excellent concurrency features and compatible with Kubernetes.
26:00
An HTTP endpoint is provided to request status for a consumer on demand. There are also configurable notifiers that can send status out via email or HTTP calls to another service like Slack or PagerDuty. Now lag in Kafka terms is essentially head offset of a broker,
26:21
the last offset at the broker level for a partition for a topic, minus consumers offset for a particular partition of a topic. It is again a typical long value. Burrow internally take care of both offsets, calculates the lag, stores this data locally in a time series format.
26:40
And on a sliding window interval, one hour, two hour, four hour, these trends are evaluated on top of this time series values and respective alerts are fired. Now, as you can see in this example, the consumer Kafka set on the left-hand side is on the rise for the last 60 minutes, depicting consumption of messages
27:01
are happening regularly. While on the lag graph has no specific uptrend or a downtrend, suggesting the consumer is able to catch up with the production rate now and then. We as a Kafka administrator need not to be worried here. Status of the consumer will be evaluated as okay by burrow and similar lag-evaluated tools.
27:23
Now taking another case, this is a typical trend where the consumer lag is increasing. It is a clear uptrend on the left panel along with consumer offsets, which is also increasing. It's pretty evident that consumer is not able to process data at the speed of being written by the producers.
27:40
Now, if we plot an hourly rate graph for a four-hour window, it is evident that the consumer rate is lingering around two million spaces per hour, the blue line on the bottom, while the producer rate has been suddenly increased well beyond two million and some work is needed at the consumer level to match the incoming rate.
28:00
The status for this consumer will be in warning state. If the production rate decreases back to two million per hour, a consumer can eventually catch up. But if this uptrend continues, it can be a problematic situation since each message at the broker has a time to live, TTL, and there is a great risk of losing data
28:20
to then approaching TTLs. So some thoughts and adjustment needs to be made here. This is a single example. This is an actual pipeline working in one of the services. The next steps which are advised here is probably adding more partition to the topic, adding more consumers to the group, or scaling servers, hosting the consumers, et cetera.
28:42
The big four metrics along with the broker dashboard helps us take that decision. Consumers can stop processing incoming data entirely and can be stalled, with lag increasing at the same rate as the producer rate. Plotting early chart confirms the behavior that the consumer is evaluated as stalled
29:02
and then error message will be sent across. Burrow does all of this for us. Now stalled, consumers can be, for many reasons, related to space we are in. Servers crashing, network packet dropping, disk on which consumers are hosted being full and so on. Another benefit of keeping track of stalled consumers
29:21
is a way to determine which consumers are not in use. Now let's try to understand if we can draw more insights from these available offsets time series panel graph. The two early broker and consumer offset line graphs
29:41
are looking almost identical, denoting there must not have been an issue in the given time period which is four days. While we look at the consumer lag line graph, there are spikes occurring at various points in the period. Burrow or any other tool will detect the spikes and send across alerts
30:00
to the administrator. One of the fundamental issues we face while analyzing absolute values is forming alerts on a trend which does not have a relative base and thus are false alarms. There are at least seven to eight false alarms here in this four day period. Now to convert this consumer lag
30:21
in relative terms, let's take this scenario of producer consumer in Kafka. Producer was at offset value 134 at midnight, 144 in 10 minutes and 154 in 20 minutes. We can derive the producer rate in this case is one message per minute while the consumer is running behind and is at offset value 134
30:42
while the producer was at 154. With this data, we can derive a lag which is time-based. Taking the difference of the last producer offset and the last consumer offset at a given point of time and dividing it by the producer rate which is one, we can determine the lag in time units.
31:01
So here at 12 20 in time unit units and where 20 signifies that if a producer stops sending more messages to the topic, it will take 20 minutes for the consumer to catch up. Now with our absolute
31:20
consumer offset lag graph, we plot time-based side by side. In the time-based lag graph, consumer is mostly 10 minutes behind the consumer. This is minutes itself. Now there are no big peaks in the graph apart from six, seven odd long bars which does not have continuity.
31:40
The chart at the bottom does not suggest any minor or a major issue and we can very well call this a normal trial for a consumer. Just like we guessed when we were looking at the producer and the consumer line graph couple of slides back. Converting the lag in time units provide much better understanding of how much consumer is actually behind and sufficing our SLAs or not.
32:01
Different topics and the configuration, partitions, replicas can be load tested to have this time units as a base. Going away from this absolute offset values and having time units brings all these Kafkaal pipelines together under the same umbrella and you can compare pipeline versus pipeline not only configuration versus configuration.
32:22
What we're looking here is nothing new. It is a pure statistics function and being discussed by other Kafka users on various user forums. Now one of the most important activities we do with time series data is performing an analysis on trends. That is understanding historical behavior to predict future.
32:41
How much data came in last week versus the previous one and expected traffic for a big day sale etc. Analyzing trends provide clarity. Now starting with rate of topic growth.
33:01
Growth here suggests producer rate moving up with more customers coming into our platform. Is the lag well under control or do we need more partitions? You can make an educated guess by analyzing previous intervals or other topic metrics data. One of the fundamental issues we have seen
33:21
are abnormal spikes at producers without the service provider being informed. In most cases we don't control incoming data rate but right expectation needs to be set to not overwhelm your customer entirely. Similarly consumers down spike needs to be alerted which indicates potential infrastructure failures.
33:45
Every message being returned to partitions in brokers are kept for a while has a time to live and a retention span for the consumer to read and process it. If the time lag goes beyond the set TTL it is guaranteed that the few messages will be lost at the broker level.
34:02
Looking at the time lag as a historical trend over time is better than looking at an absolute value at a given point of time. Now we did not discuss the infrastructure or the process level metrics for Kafka in this talk but servers, network layers, disk, cloud provider,
34:21
if applicable etc are equally important as functional metrics. Having proper visualization on CPU, memory and storage consumption helps us avoid cluster killing scenarios. Now if you are still running Zookeeper ensemble for Kafka brokers taking a holistic view at its own health is desirable also Kafka based on the hardware it is hosted upon
34:44
has a soft upper limit of how many partitions it can hold. Currently it is advised to have a maximum of 4000 partitions per broker. This is partly due to the overhead of performing leader election for all those on Zookeeper. Now in this talk we may have overlooked some very important metrics
35:01
or trends which experienced Kafka practitioners have known over years. I'm really looking forward to get feedback and learn from the community to even build this observability framework pipeline much better. These are the respective references this talk has been taken inspiration from in one way or other starting with the fundamentals of visual Kafka documentation.
35:21
There are a bunch of articles, posts, YouTube videos available for Kafka architecture and various monitoring solution. I have listed some of them. I highly recommend Pepper Data's Kafka monitoring talk on YouTube. Please refer borrow Prometheus Grafana official website to know more about them and understand their integration better. And for good study material and observability
35:40
Google Cloud team has published an excellent blog measuring improvements in a distributed system. With that I'm open to questions or anecdotes from your experience. I really hope this was useful in some way. I'm available at all social media platforms at Sarkar Amrit too. I thank you again for being here in this conference and this talk. Thank you.