Apache Kafka simply explained
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 56 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/67148 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Musical ensembleBitOpen sourceComputing platformEvent horizonXMLUMLLecture/Conference
00:33
Fundamental theorem of algebraDemo (music)Fundamental theorem of algebraRepository (publishing)Standard deviationProjective planeBlock (periodic table)BuildingFlow separationComputing platformPlanningEvent horizonTerm (mathematics)BitMereologyData streamArithmetic meanComputer animation
01:51
Computing platformProjective planeSoftware developerDifferent (Kate Ryan album)Physical systemWater vaporInternet der DingeServer (computing)Arithmetic meanInterface (computing)Normal (geometry)Set (mathematics)Order (biology)Functional (mathematics)Module (mathematics)Front and back endsJava appletLibrary (computing)NumberTerm (mathematics)Limit (category theory)Flow separationCASE <Informatik>Object (grammar)MiniDiscBasis <Mathematik>Android (robot)Web 2.0Service (economics)outputTrailDebuggerTelecommunicationUser interfaceComputer architectureData storage deviceDataflowMultiplication signPoint (geometry)Product (business)Operator (mathematics)Cartesian coordinate systemComputing platformEvent horizonMessage passingMultiplicationInformationClient (computing)Fluid staticsData conversionIndependence (probability theory)FeedbackGroup actionMaxima and minimaReal-time operating systemTraffic reportingVolume (thermodynamics)ScalabilityWeb pageElectronic data processingTransportation theory (mathematics)Open sourceElectronic mailing listUniform resource locatorLecture/ConferenceComputer animation
08:37
State of matterEvent horizonOrder (biology)Product (business)Message passingFront and back endsGroup actionEvent horizonPhysical systemModule (mathematics)Formal languageTransportation theory (mathematics)Data structureEndliche ModelltheorieProduct (business)InformationTable (information)BitData conversionDatabaseImage registrationSynchronizationDifferent (Kate Ryan album)Term (mathematics)Video gameMiniDiscMultiplication signRow (database)TwitterPosition operatorPoint (geometry)View (database)Cartesian coordinate systemDataflowCycle (graph theory)Control flowError messageElectronic mailing listSequencePhase transitionCellular automatonNumberCharacteristic polynomialDegree (graph theory)Variety (linguistics)Computer architectureRight angleState of matterMathematicsPurchasingMetric systemInteractive televisionArithmetic meanReading (process)Computer animationLecture/Conference
15:06
Product (business)Event horizonMessage passingEvent horizonCartesian coordinate systemMultiplicationQueue (abstract data type)InformationDifferent (Kate Ryan album)AngleRepository (publishing)Multiplication signDegree (graph theory)BitComputer animationLecture/Conference
15:55
InformationPublic key certificateRevision controlPasswordProjective planeJava appletElectronic mailing listComputer fileOnline helpPoint cloudRevision controlInformationOpen sourceSharewareSource code
17:09
Key (cryptography)Social classPasswordCommunications protocolInformation securityData typeMessage passingString (computer science)Thread (computing)InformationFluid staticsOperations researchRandom numberProduct (business)Client (computing)Computer networkRevision controlDatabase transactionPublic key certificateChainInternet service providerImplementationAlgorithmMilitary operationInfinityLocal GroupServer (computing)Service-oriented architectureTerm (mathematics)Multiplication signSingle-precision floating-point formatPartition (number theory)Online helpCodeAddress spaceSequenceVirtual machineSerial portLoginConfiguration spaceMechanism designData storage deviceComputer filePhysical lawSemiconductor memoryString (computer science)Message passingJava appletLibrary (computing)Row (database)Object (grammar)Communications protocolClient (computing)Category of beingSlide ruleWater vaporStreaming mediaLoop (music)Goodness of fitNumberEnumerated typeDataflowElectric generator2 (number)Product (business)Operator (mathematics)InformationArray data structureMultiplicationParameter (computer programming)Modal logicData structureComputer animationLecture/Conference
23:22
Partition (number theory)Key (cryptography)String (computer science)Beer steinInformationMessage passingThread (computing)Military operationProduct (business)MathematicsNumberInformationService-oriented architecturePartition (number theory)Line (geometry)Key (cryptography)Message passingPurchasingOnline helpDemo (music)Physical systemBitSource codeRow (database)Complete metric spaceReplication (computing)Server (computing)Product (business)DataflowDemosceneMereologyCASE <Informatik>Order (biology)Single-precision floating-point formatComputer animationLecture/Conference
29:34
Partition (number theory)Message passingKey (cryptography)Covering spaceSoftware maintenanceSlide ruleWindowPartition (number theory)Set (mathematics)Line (geometry)Physical systemObject (grammar)File formatGame controllerSoftwareRevision controlConfiguration spaceCodeData structureComputer fileReal-time operating systemOpen sourceService-oriented architectureRepository (publishing)Online helpOpen setProcess (computing)Link (knot theory)Streaming mediaFundamental theorem of algebraMessage passingInformation securityKey (cryptography)Replication (computing)Utility softwareDivisorWindows RegistryMilitary baseMultiplication signProjective planeFormal languageBasis <Mathematik>Java appletConnected spaceBitRadical (chemistry)Computer-aided designDifferent (Kate Ryan album)Lecture/ConferenceComputer animation
33:48
DatabaseBridging (networking)Utility softwareWindows RegistryData structureSource codeLink (knot theory)Software repositoryRepository (publishing)Revision controlSharewareComputer animation
34:26
Intrusion detection systemBookmark (World Wide Web)Physical systemStreaming mediaOpen setGene clusterDatabaseSource codeWordLecture/ConferenceMeeting/Interview
35:36
Physical systemLecture/Conference
36:24
Cartesian coordinate systemComputer virusRow (database)MereologyServer (computing)Basis <Mathematik>Lecture/Conference
37:15
Streaming mediaServer (computing)Lecture/Conference
37:49
Data storage deviceModule (mathematics)InformationSynchronizationMeeting/InterviewLecture/Conference
38:42
Cartesian coordinate systemOpen sourceMessage passingComputer virusLecture/ConferenceMeeting/Interview
39:22
Projective planeMultiplication signData storage deviceMessage passingRevision controlLecture/Conference
40:05
Musical ensembleLecture/ConferenceJSONXMLUML
Transcript: English(auto-generated)
00:07
Hello everyone. Before we begin, I wanted to apologize. If I cough a bit, I will try to do it like this. Because I had a flu, actually a couple of weeks ago, but still this stupid cough remains. So, welcome. Today we are going to talk about an open source platform that you undoubtedly heard of.
00:28
Unless, of course, you came here to learn more about Franz Kafka. Then I must disappoint you. Because today we are going to talk about distributed event streaming platform that already exists for over 10 years and became de facto standard for data streaming.
00:45
It is widely used in the industry, but still somehow can be challenging to understand. That's why our goal for today will be to look at this technology and its fundamental building blocks in simple terms. Doesn't mean we will not dive deep, but we'll try to keep it simple.
01:04
So, what is our plan? We will start with a bit of a background story, where the need for Apache Kafka is coming from. Then we will look at the theoretical concepts and the basic building blocks of Apache Kafka. And we will trade those theoretical parts with some practical exercises and demos.
01:23
And finally, I will leave you with some ideas what you can do if you plan to start using Apache Kafka in your projects. I also prepared a GitHub repository with the examples I will be showing. So, if you are really good at doing several things in parallel, you can even try to follow me here.
01:42
Or you can keep it later to experiment with it. So, before giving the definition to Apache Kafka, what I wanted to do is to give you an example of a project where Apache Kafka makes a significant difference to the users and to the developers of the system.
02:01
And my ingenious project idea is based on an animation movie you might have seen, Zootopia. If you haven't seen it, no worries. But if you have, you will recognize some of our characters. Because today, you and me, we are going to build the first e-commerce product of Zootopia and we'll call it Shopitopia.
02:21
Like in any e-commerce project, we would like to have a set of products we sell. And some interface, maybe web interface to start with, where our lovely customers can select what they want, search for orders for products, place an order, pay and wait for delivery.
02:42
And at start, when working with the system, we might decide to keep all our modules next to each other inside a single monolith. Where our frontend will communicate directly to the ordering backend module, module for payments and for deliveries.
03:04
We probably have some data store next to those modules and it all is closely interconnected. And this might work at start, when we still have small number of customers and limited amount of functionality.
03:22
However, when our system becomes more popular and we have more customers coming and we decide, for example, to replace our frontend with separate application for the web, Android and iOS. For user convenience, we would like to add also a notification service, maybe some recommendation engine to suggest some other things users can buy.
03:44
And also a delivery tracking service which will help them to follow the delivery path to their homes. And if we continue adding these modules into our monolith, very soon the architecture and the communication flow of the system will become a mess.
04:09
A mess that is difficult to support and difficult to expand. And with our development team growing, no single individual will be able to keep up with the data flow of our system.
04:22
It will become as painful as untangling the cable headphones after a day in your pocket. And at this point of time, we will need to have a tough conversation how to divide our monolith into a set of independent microservices with clear communication interfaces.
04:41
What is even more important, our new architecture should allow us to be as close to real-time communication as it is possible. To rely on the real-time events so that our customers don't have to wait until tomorrow to get meaningful recommendation based on their activity done yesterday.
05:01
Nowadays, everyone, including us in this room, expects immediate feedback based on our actions. We don't really want to wait. And as developers, we want to get the maximum value out of real-time data in the most efficient and hopefully painless way.
05:22
What would be also cool if our new system would also give us some tools how we can handle real-time monitoring, reporting and data processing. And admittedly, this is quite a lot to ask. Introducing such activity tracking is an immensely high-volume operation.
05:40
And we need to be very careful to remain resilient, not to lose any data. Luckily, Apache Kafka can help us exactly in this scenario. Apache Kafka is really helpful at untangling data flows and simplifying the way how we handle real-time data.
06:02
So now I think we are ready to come back to the definition of Apache Kafka. And I know definitions are boring, but it's also so important for us to be on the same page. So what is Apache Kafka? Apache Kafka is an event streaming platform that is distributed, scalable, high throughput, low latency and has a very wide ecosystem.
06:25
Or, simply put, it is a platform to handle transportation of messages across your multiple systems, multiple microservices. Can be like in our example a front-end and back-end modules, but can be also a set of IoT devices.
06:44
Maybe it's a teapot in your kitchen sharing information about the water temperature with your mobile phone. Apache Kafka is distributed, meaning that it relies on multiple servers with data replicated over multiple locations. So that if any of those servers go down, we are still fine, our products can still normally function.
07:10
Also Apache Kafka is scalable, meaning that we can have as many of those servers as needed. And they can handle millions of items going through them on a daily basis with petabytes of data stored persistently on those disks.
07:33
And what is amazing about Apache Kafka is its community and ecosystem, including the client libraries.
07:42
And we will see one for Java today, but also there are plenty of others. And a set of connectors which you can use. So that if you decide to use Apache Kafka, you don't have to reinvent the wheel, but you can rely on the work of other amazing developers who already solved the issues and shared their solution as an open source product.
08:09
So now, to understand how Apache Kafka works and how we can work with Apache Kafka, we need to talk about Kafka's way of thinking about data.
08:21
The approach which Kafka takes is quite simple. Instead of thinking of data in terms of static objects or final facts, it looks at data and describes entities by continuously incoming events. So in our case, for example, we have a list of products we sell.
08:41
And the information, the characteristics of those products can be described in a table, in a database. And this gives us some valuable information, some valuable insights. However, if we come up with new questions, for example, if we want to get more data about the sell trends, about the peak
09:01
search times, demand, it will be quite tricky to deduct this information from this table in the database unless we plan this in advance. Because anything which is stored in the table, in the database, is a compressed snapshot, a one-dimensional view, or you can think of it as a single point on an infinite timeline.
09:25
So what if, instead, we will look at this data as a flow of events? For example, a customer ordered a tie. Another customer searched for a doughnut. And then we dispatched the tie to the first customer and the second one decided to buy a doughnut.
09:44
And so on, we have events coming into our system. And this shows us a whole cycle of a product purchase. Instead of seeing a single dot, we observe the change of state. We can replay the events. If we want, we can move to the start, we can move to some certain point of time.
10:08
We cannot change the events. They already happened. But we can kind of relive them again and again, calculating different metrics and answering a variety of questions we might have even later.
10:20
So this approach gives us a 360-degree view on our data and plenty of possibilities. And it has a name which you're probably familiar with. It is an event-driven architecture. So now, let's see how Apache Kafka works with event-driven architecture. I will put Apache Kafka cluster in the middle and on the left.
10:44
And on the right, you will see how other applications interact with Apache Kafka cluster. Apache Kafka cluster coordinates data movements and handles transportation of incoming messages. It uses a push-pull model to work with the messages,
11:03
meaning that on one side we have structures which will create the message and send it to the cluster. These structures are called producers. Producers are applications you write and control. And on the other side, we have consumers.
11:22
Consumers will pull the data from the cluster, read it, and do whatever you need to do with that information. Consumers, again, are applications you write and control. We can have as many producers and as many consumers as we need. So, for example, in our system, producers can be connected to the front-end applications,
11:45
observing user action, and then sending the records with those events to the cluster. And the consumers can be connected to the back-end modules for notifications, for recommendation engine, for delivery system.
12:05
Producers and consumers can be written in different languages, completely not knowing about each other. In fact, it can happen that we want to shut down one of the producers and replace it with another one.
12:22
And our consumers will not even realize that there is a difference. And it is possible that one of the consumers, something happens, and it needs to restart or something else needs to happen there. And producers will continue sending data into the cluster, and the data will be persistently stored.
12:42
And when consumer recovers, it can start where it left off. So, in this way, there is no synchronization expected between producers and between the consumers. They work at their own pace, at their own convenience. And this is how Apache Kafka helps us to decouple our system.
13:07
So, now that you know who sends the messages and who receives them, let's look inside the cluster at the records we have there. So, the events which are sent one by one into the cluster, we call a topic.
13:25
A topic actually is an abstract term. So, in fact, a topic is not how we store the data on the disk, but rather how we think of it, just to make our life a bit easier. So, we can have as many topics as you need, and they can be compared to tables in the database.
13:47
For example, we can have a topic related to product ordering lifecycle. We can also have a topic related to the user registration and conversion,
14:00
or maybe a topic that contains the application events, any warnings, any errors, and just the health state messages coming from our multiple systems. The data is continuously flowing. There are no pauses, no breaks. If our application continues working, we have customers who are registering,
14:25
we have customers who are purchasing the products. The messages are ordered. So, each message has a sequence number which is also known as offset, and the number offset uniquely identifies the position of the record.
14:44
The messages also are immutable, so you cannot really change them later, and this makes total sense. If some event happened, someone bought a doughnut. It's not like you can go back in time and change that fact, unless, of course, you are Michael J. Fox with DeLorean,
15:01
but otherwise, if you don't like your doughnut, you can throw it away. But technically, this is a new event. So, you can look at the topics and you can say that they remind you of a queue. But here's a twist. In Apache Kafka, the consumed messages are not removed from the queue and not destroyed.
15:24
In fact, it's quite common that multiple consuming applications will be written from the topic and using that information from different angles, because, as we said, when working with events, we can approach our data from 360 degrees.
15:43
And I think now that you know a bit about the topic and a bit about the producer and consumer, it is time to pause and create our own producer and consumer. In the GitHub repository, you will find a bit more detailed steps how to run these examples.
16:03
This is a Java project, and you will find the list of dependencies you will need to have. Also, you will find the JSON file, where you need to put the information about your cluster, your Apache Kafka cluster, to be able to connect to that cluster.
16:21
And yes, to run these examples, you need to have your Apache Kafka cluster. Apache Kafka is an open-source project, so you can use its source code to set up it locally or somewhere in the cloud. You can also use the help of Brew or Docker, or you can move one step further and use one of the available managed versions of Apache Kafka.
16:47
I personally used Ivan for Apache Kafka, and knowing how much my love my team puts into Ivan for Apache Kafka, I can only recommend it, and you can use it as well. There is a free trial which you can use,
17:02
which will be totally sufficient for all the examples I'm showing here. So here, we are going to look at a single topic, and we are going to send records with the help of a very simple producer, and then we will read the data with the help of a very simple consumer.
17:21
And before we start writing the code to produce and consume messages, we need to configure our producer and our consumer, so that they know how to connect to the cluster and how to send the data. So we need to specify the address of the cluster, as well as the serialization and the serialization mechanism,
17:43
so in which way the data will be transmitted. Then, we need to make sure that our cluster will be able to trust our producer and consumer, and for this, I'm using SSL protocol, and the properties which you can see here are exactly those which we have in the JSON configuration file.
18:09
And now that we are done with this, we can move and write the code of the producer. I'm relying here on the client library for Java, for Apache Kafka,
18:23
and I'm creating a producer object. Also, we need to decide on the name of the topic that we are going to use. Then, we will create a message, what we want to send into the cluster. I'm using here a JSON object. Doesn't have to be JSON, can be a string, can be something else.
18:42
And then we package the JSON object inside the record. And finally, we send the record into the cluster. And this will send a single item into the cluster. But for our example, I wanted to have a continuously flowing stream of data so that it goes like river, like water in the river.
19:02
That's why I used a good old friend while through loop and added a second of delay. And you probably wonder what I have in those messages and what we have in the generateMessage method. It's quite simple. I actually create here an object with three properties.
19:23
Which customer does which operation to which product. And I have a predefined arrays for values which I randomly select. So now, if we run our simple producer,
19:40
we will see that after connecting to the cluster, we start sending data, those small JSON objects, into the cluster. But of course, you don't have to trust me. To make sure that the records indeed land in the cluster, let's create our consumer and bring the data from the cluster.
20:03
To create the consumer, again, we are relying on the library, on the client library, and creating a consumer object. And we are subscribing to the topic, the same topic as we were sending data just a second ago. And then, on the regular interval, we go and check,
20:21
do we have any new records there or not? And if we do, we print out the information about the records. So, the simple consumer is located next to the simple producer. And if you run it, you can also observe that first we connect to the cluster
20:41
and we are using SSL mechanism to establish all the necessary trust parameters. And then, we're starting to pull the data and we print out the JSON objects which we sent there before. So, this is how we can work with the topic,
21:03
with the help of a producer and consumer. Let's add two more concepts into our story, brokers and partitions. I mentioned already before that Apache Kafka relies on multiple servers.
21:21
And in Apache Kafka world, we call those servers brokers. And topics need to be stored somehow over those brokers, over those servers. And you can imagine that topics can grow to be quite long. If we continuously add more and more information with the time,
21:42
we get a very big topic. So, we need to find the way how to store it efficiently. And this is where I wanted to come back to what I mentioned before, that a topic is actually an abstract term. The topic itself is not a physical, tangible thing stored as a whole on a single server.
22:01
It's probably neither reasonable nor feasible to keep a topic as a single piece of data on a single machine. Very probable that one day the size of the topic will outgrow the server's memory. So, we need to find a way to scale horizontally and not vertically.
22:21
And for this, what we are going to do, we are going to split our topic into multiple chunks stored across multiple machines. And these chunks are called partitions. Partitions are technically, a partition is technically a log of messages.
22:41
And if you are not familiar with the concept of a log, it's a pretty simple data structure, append only sequence of records ordered by time. And those logs, we store, or we will call them partitions, we store across brokers. Their partitions are independent, self-sufficient entities.
23:03
And each partition will be responsible to maintain the sequence number, those offsets, for their own values. So, sadly, there is no beautiful flow enumeration like right now on the slide. The offsets are local to the partitions, so let me fix it.
23:23
So, when I am saying a record with an offset number 2, I need to specify of which partition. And behind the scenes, our producers and consumers know quite well how to work with multiple partitions. However, there is still one challenge which we will need to solve.
23:46
And a spoiler alert, the challenge is related to the ordering of the messages. Because in many cases, we want to get the same ordering the messages read by the consumer as it was sent by the producer.
24:01
And let me explain what I mean. We have a flow of data, so in this example, we have new records coming from some kind of source, and we would like to store it to send it into a topic. And our topic consists of three partitions. If we don't do anything extra, the records will be divided across partitions
24:24
in some way which Apache Kafka will find the most efficient. So, here Apache Kafka tries to help us. But the tricky part comes when we start consuming those messages.
24:40
Because our consumers also want to be fast and efficient. And it is very possible that the consumer will connect to partition number 2 first, read messages from there, then move to partition number 1, then move to partition number 3. So, the ordering of the partitions, how the data will be read, we cannot guarantee.
25:03
What we can guarantee is only the order within the single partition. So, the order of the messages within the single partition. And we need to use this to our advantages in those cases where the order for the messages is important.
25:21
And for us, for example, it's quite important that the order of the messages related to activity of every single customer is maintained in the correct order. So, the customer should first select the product, add it into the basket, then pay for that product, and then we send it for delivery.
25:41
If the order is messed up, this makes our system a bit, actually, completely useless. So, to guarantee that the ordering for every single customer remains correct, we will split our messages across partitions in a way that data related to each customer
26:02
will get exactly into the same partition. And there are different ways how we can do it, but we will see how this can be done with the help of a key. Before, we were sending our data into the cluster just to message the value,
26:20
but every message can be accompanied with a key. And the key plays a dramatic role in how messages are distributed across the partitions, because all messages with the same key will be getting into exactly the same partition. So, here we have a topic with product purchase lifecycle, and we have some customers,
26:45
so we have a folks which is doing some shopping, we also have some other customers, and when we start sending the data, and here we are going to use the customer ID as our key, so we send a message related to activity of the folks, and it lands in partition number one.
27:04
This indicates to Apache Kafka that all following messages with the key folks should land exactly in that partition, and the same happens to all of our other records. So, in this way, we distribute the records, but we control which records get into which partition.
27:27
So now, when we start assembling the data with the help of a consumer, we will have no risk that the ordering of the data will be incorrect. And we can see this also inside the demo.
27:44
So, we can improve our consumer a bit and add information about the partition and offset of every message that we read. And we can also improve our producer by introducing the key. So now, when we're packaging our message, we need to add the value, so the body of the message,
28:06
also the key, and here I'm using the name of the customer, I mean it should be an ID, but let's say my small system has only a small number of customers, so that's okay, and we are sending that information to the cluster.
28:21
So now, if we run our producer, improved producer and improved consumer, we can see that when we are reading the data, all messages related to single customers get into dedicated partitions, and that partition doesn't change. So, all messages related to activity of the folks gets into partition 0,
28:44
to the line into partition 1, and to the Mr. Big into partition 2, and so on. I also have a filtered consumer, which you can run, where I filter data only for one of our customers,
29:01
so that you can see, so that indeed, for example, all data for Judy Hopps gets into partition 1. And this is the role of brokers and role of partitions, but our story wouldn't be complete if we didn't talk about the replication.
29:24
Replication is quite important. As we said, we want to have multiple copies of our data stored across various locations, so that if any of the servers goes down or something else happens, we still have enough copies. We also want to take into account for maintenance windows and other situations.
29:44
So, that's why we need to have extra copies. Let's come back to this slide, where we were talking about how we divide a topic into a set of partitions. And dividing the topic into these digestible chunks also gives us kind of a straightforward
30:06
or maybe like elegant possibility to replicate the data. And to achieve this, what we will do, every broker will hold not only one partition, but we will have several copies of every partitions balanced across the brokers.
30:27
So, in this way now, if you look at this, we have two copies of every partitions across our cluster. The fact that it's two is also called a replication factor of two.
30:42
To take into account for maintenance and other things which might happen with the data, we usually, actually it's usually recommended to use a replication factor of three, and you can see how this would look with the three copies of the data.
31:04
And I think, to be honest, I see the time is still 30 minutes, and I actually think spoke quite fast. So, I think we are slowly writing, like I thought that we will run out by this time. We covered plenty of things, and I would like to leave you with some ideas for the future
31:25
if you want to start using Apache Kafka in your projects, what to pay attention and what you can look at next. So, first of all, I would recommend to practice fundamentals. We created a producer and a consumer.
31:42
You can use other languages. If you are not really using Java on a daily basis, the code for JavaScript or Python or Go would look actually quite similar. Maybe this security connections will look a bit different. Also, if you don't want to use the code to send the messages to a topic
32:01
and read the messages from the topic, you can use a KiCad utility, which is very helpful because you can use a terminal and you also again need security credentials stored there, but then you can send through the terminal line by line the items into the cluster
32:21
and you can read the items through this utility. Also, check schema registry. Schema registry is amazing. We will send in the data as a JSON object. But, of course, when your system grows, you want to have a better control on what format of the data is flowing
32:41
and what is exchanged between your microservices. You probably also want to have a versioning of that data structure. So schema registry is super helpful for that. Then there are connectors. Connectors are just magical pieces of software which will help you to connect your existing systems
33:02
to the Apache Kafka cluster, almost with no code, just a configuration file. Or, of course, if you want to create a new connector, you will need to use some code. And there are plenty of the connectors existing already open-sourced, which you can use. And in the GitHub repository, which I shared,
33:22
there is an example how you can connect your Apache Kafka cluster to an open-search cluster and how you can bring that data there and to visualize it with the help of open-search dashboards. Then, of course, for the real-time processing,
33:40
you need some tools, and Kafka Streams is great at helping there. I left all these links and some more stuff. There are some examples which you can also follow in the GitHub repository. So check it out. I hope you will learn something new from there.
34:01
And finally, if you plan to work with Apache Kafka or looking for a place where you can evaluate if Apache Kafka is indeed the solution you need, check out ivan.io. We have a managed version for Apache Kafka, and there is a free trial to try it out. So with this, thank you so much for being at this session,
34:23
and I'm all ears to your questions. So who wants to ask a question?
34:42
No questions. I'm going to... Yeah. What's your favorite thing about Apache Kafka? I think it's how we can connect and transform the data from one system to another. So often we want to bring the data
35:01
from one data source to another data source. For example, we have something in MySQL, and we want to bring it into open search cluster. And especially if we want to modify it on the flight, we can use Apache Kafka, and we can use Streams and also the connectors,
35:20
and especially for the popular databases such as PostgreSQL, MySQL, or other Apache Kafka clusters, there are already the connectors available. So it's quite easy to... Okay, I would not say the word easy, but you don't need to reinvent this wheel to connect those pieces,
35:41
and you can use it like a Lego to bring your data from one system to another, and then, for example, this open search to visualize how you need it. And also the fact that Apache Kafka exists already for over 10 years means that there are a lot of tools which are quite useful and for different scenarios
36:01
because a lot of people are using it and running into different challenges and solving those challenges. So I think as engineers, if you work with Apache Kafka, we build on top of it, and it's kind of this shared solution.
36:20
And... Oh, I see another question. Hello. Yeah, I would just like to ask how fast is Apache Kafka? Can we use it for low-latency applications? Could you repeat which applications? How fast is Apache Kafka? Can we use it for low-latency applications?
36:43
So Apache Kafka is fast. However, I am not sure the second part of the question about NC applications, so I just may be not familiar with those. But generally, Apache Kafka is very fast, and you can really bring, for example,
37:01
millions of records on a daily basis, and they will be stored on the servers. So, yeah. Just a practical question. If you have multiple consumers for a stream and it's really fast, doesn't the data have to persist on Kafka?
37:22
And how do you manage the persistence? How long is it there? Actually, okay. Maybe I was not explicitly correct because indeed we persist the data in Apache Kafka server. So we bring the data so we don't remove the data is flowing and stored on the servers.
37:42
Doesn't that become a bottleneck then, or does that affect performance after a while? I would say, so, of course, if you wouldn't need to store it, everything will be faster, but also so many things will not be possible to do. So this is just side of it. There are other solutions which don't store the data,
38:01
but I would say it depends on the scenario. Often we really want to store the data persistently. For example, if something goes, any of our modules go down, and then when it restores, it wants to continue reading that information. We don't want to have the synchronization between the modules.
38:21
We want that they are working at their own pace and consuming the data, and they only need to communicate to the cluster and they take data from there, and the data needs to persistently be stored there. Of course, if we didn't have to do it, it will be even faster.
38:44
Any other questions? If not, I will be still here and I will be today, tomorrow, and if you see me, don't hesitate to come and just talk about maybe Apache Kafka or other open source applications. Oh, I see a question.
39:05
Hello. Is it recommended to have a schema for every message that the producer produces? I hear you talking about Apache Avro. Is it recommended to have a schema for every message?
39:22
I can't say much about Apache Azure, but yes, of course, we need in big projects. We want to align and have the schema, which schema can evolve with the time, so it is important that we have a schema of the message, and Apache Kafka allows that. So you can send schema with the message,
39:43
but I think with the time it will get really out of hand because your message increases in size if you connect the schema. So, of course, probably easier to have a separate storage for the schema, and then you can also advance and have the versioning of the schema there. Yeah, but it's very, very helpful to have schema.
40:07
So talk to me later if you have more questions or just want to talk. Thank you so much for being at the session.