We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Kafka on Kubernetes

00:00

Formal Metadata

Title
Kafka on Kubernetes
Title of Series
Number of Parts
94
Author
License
CC Attribution 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
In this talk we provide a short introduction to Apache Kafka and walk you through the steps to deploy Apache Kafka with Strimzi Kafka Operator on Kubernetes and show you how you can manage it using native Kubernetes tools. Apache Kafka is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, and can be used for Stream Processing, as a Storage or Messaging System and more. Running and operating Stateful apps on Kubernetes is not easy, at least if you’re going to deal with replication and have to take care of syncing and re-balancing your streaming data on different nodes and / or different clusters in different regions. Kubernetes is about Resiliency and Scale, Kafka too! Kafka is Stateful, Kubernetes' support for Statefulsets has reached a mature state!
15
Thumbnail
1:16:56
20
Thumbnail
59:24
23
Thumbnail
10:02
48
Thumbnail
50:08
56
76
Thumbnail
12:05
FreewareOpen sourceAdventure gameProduct (business)FamilyWave packetTable (information)NeuroinformatikNoise (electronics)Information technology consultingComputer scienceAxiom of choiceProcess (computing)XMLUMLLecture/Conference
Goodness of fitDeclarative programmingMereologyConservation lawComputer scienceExpected valueComputer animation
Declarative programmingRun time (program lifecycle phase)Cartesian coordinate systemConservation lawRun time (program lifecycle phase)HeuristicType theoryMusical ensembleComputer networkHookingComputer animation
Run time (program lifecycle phase)Computer networkInformation securityService (economics)TelecommunicationComputer networkWordComputer animation
Declarative programmingVertex (graph theory)Set (mathematics)Computer networkService (economics)Information securityNeuroinformatikService (economics)WordCartesian coordinate systemInformation securityBlogComputer animation
Electronic mailing listTheoretical computer scienceLoginElectronic mailing listPosition operatorComputer animation
Element (mathematics)Element (mathematics)Electronic mailing listPosition operatorType theoryRow (database)Matching (graph theory)Computer animation
Interior (topology)Reading (process)Electronic mailing listAreaFunctional (mathematics)Position operatorLoginIntegerArithmetic meanElement (mathematics)Computer animation
Rule of inferenceSystem programmingLoginData structureGoodness of fitRandom accessSoftware engineeringMusical ensembleFunctional (mathematics)BlogTablet computerData storage deviceRule of inferenceNumberPhysical systemComputer scienceComputer animation
ComputerSystem programmingComputerPartition (number theory)NeuroinformatikError messageReplication (computing)Partition (number theory)Crash (computing)Computer animationProgram flowchart
Business clusterPartition (number theory)outputLoginPartition (number theory)DatabaseSet (mathematics)Schmelze <Betrieb>Computer animation
Partition (number theory)Physical systemDemo (music)Message passingPartition (number theory)Multiplication signDiagram
Partition (number theory)BitPartition (number theory)Message passingWater vaporEngineering drawingDiagram
Partition (number theory)Local GroupPetri netMusical ensemblePartition (number theory)Scaling (geometry)Library (computing)Diagram
Partition (number theory)Local GroupMultiplicationJava appletBookmark (World Wide Web)Heegaard splittingMultiplication signCASE <Informatik>Partition (number theory)VirtualizationMessage passingContent (media)DiagramProgram flowchart
ComputerReplication (computing)ComputerPartition (number theory)Service-oriented architectureBusiness clusterService-oriented architecturePartition (number theory)MultiplicationReplication (computing)Crash (computing)Moment (mathematics)BlogComputer animationDiagram
Computing platformDeclarative programmingBitPasswordSoftware developerFamilyServer (computing)PlanningCartesian coordinate systemService (economics)Computer animation
Computer networkComputing platformVirtual machineVirtualizationPoint cloudOcean currentService-oriented architectureMultiplicationCartesian coordinate systemGene clusterRevision controlBefehlsprozessorIntegrated development environmentOperating systemGame theoryBinary multiplierSocial classData storage deviceConservation lawComputer animation
Computing platformCluster samplingService (economics)Data storage deviceComputing platformCombinational logicData storage deviceElectronic mailing listState of matterDeclarative programmingCartesian coordinate systemSet (mathematics)Goodness of fitData managementService (economics)Service-oriented architectureComputer networkComputer animation
EncryptionBusiness clusterDeclarative programmingAuthenticationAuthorizationEncryptionMultiplicationGene clusterAuthenticationInformationService (economics)Computer animation
Operator (mathematics)Similarity (geometry)Game controllerOperator (mathematics)PasswordString (computer science)Mobile WebLevel (video gaming)Game controllerConfiguration spaceDatabaseGene clusterState of matterStrategy gameSocial classExtreme programmingWordRaw image formatLecture/ConferenceComputer animation
Operator (mathematics)Point cloudOperator (mathematics)CountingEnterprise architectureService (economics)Confluence (abstract rewriting)Real numberGoodness of fitLecture/ConferenceComputer animation
Operator (mathematics)Business clusterEncryptionConfiguration spaceGene clusterOperator (mathematics)Revision controlEnterprise architectureOpen sourceInformation securityData managementMusical ensemblePhysical systemData miningData storage deviceWhiteboardLink (knot theory)Computer networkConfiguration spaceEncryptionTransportation theory (mathematics)Social classDisk read-and-write headJSONXMLComputer animation
Configuration spaceSynchronizationData managementOperator (mathematics)Data managementConfiguration spaceOperator (mathematics)SynchronizationCartesian coordinate systemState of matterFactory (trading post)OpticsComputer animation
Key (cryptography)PasswordData managementAuthenticationAuthorizationOperator (mathematics)PasswordAuthenticationKey (cryptography)Data managementDemo (music)Multiplication signMereologyPhysical systemGene clusterComputer animationLecture/Conference
Demo (music)Default (computer science)Physical systemContext awarenessOperator (mathematics)Business clusterSpacetimeCloud computingBoss CorporationGoogolNamespacePoint cloudConnectivity (graph theory)Operator (mathematics)State of matterInstallation artGroup action2 (number)Social classComputer animation
Business clusterMathematics2 (number)Classical physicsGene clusterOperator (mathematics)Social classComputer animation
Revision controlData storage deviceNetwork topologyProduct (business)Computer fontData storage deviceComputing platformMultiplication signLogic gateMetreRevision controlWater vaporComputer fileMusical ensembleDemo (music)Integrated development environmentTransport Layer SecurityInstallation artOperator (mathematics)Computer animation
Business clusterOperator (mathematics)Peer-to-peerAmsterdam Ordnance DatumManufacturing execution system2 (number)Musical ensembleState of matterOperator (mathematics)Social classGene clusterComputer animation
Operator (mathematics)Business clusterDemo (music)Beta functionOperator (mathematics)Goodness of fitSocial classMereologyIntegrated development environmentInformationVideo game consoleFlagBootstrap aggregatingServer (computing)Data acquisitionWeb 2.0Computer animation
Network topologyData storage deviceMetadataInfinityComputer-generated imageryLink (knot theory)Parameter (computer programming)Normal (geometry)Variable (mathematics)Integrated development environmentGame controllerComputer animation
Demo (music)Business clusterOperator (mathematics)Amsterdam Ordnance DatumSineExponential functionConfiguration spaceNumberParallel portThread (computing)Server (computing)BefehlsprozessorBinary fileJava appletInteractive televisionDataflowMetropolitan area networkComputer animation
Partition (number theory)MetadataBusiness clusterNetwork topologyComputing platformPartition (number theory)Revision controlDivisorReplication (computing)State of matterOperator (mathematics)Computer animation
Server (computing)BefehlsprozessorParallel portThread (computing)Configuration spacePartition (number theory)Video game consoleMessage passingGroup actionRevision controlString (computer science)CoprocessorSet (mathematics)LoginNatural numberRing (mathematics)Cartesian coordinate systemType theoryMusical ensembleTelecommunicationCASE <Informatik>Utility softwareNon-standard analysisSpherical capNumberPartition (number theory)Service-oriented architectureStrategy gameSoftware testingHidden Markov modelDefault (computer science)Element (mathematics)Video game consoleString (computer science)Bus (computing)Order (biology)Multiplication signCountingComputer animation
Network topologyPartition (number theory)Default (computer science)Video game consoleMusical ensembleSquare numberVideo game consoleMultiplicationKey (cryptography)Order (biology)Computer animation
Configuration spaceThread (computing)BefehlsprozessorCategory of beingServer (computing)ParsingMessage passingError messageParallel portCross-correlationClient (computing)Video game consolePartition (number theory)Error messageDefault (computer science)Video game consoleMultiplication signOperator (mathematics)Key (cryptography)Ferry CorstenHash functionExistenceMusical ensembleBuildingEnterprise architectureGame theoryPhysical lawComputer animation
Electronic visual displayService (economics)SoftwareFunctional (mathematics)Internet-CaféProduct (business)Configuration spaceDemo (music)Default (computer science)Partition (number theory)Message passingDivisorReplication (computing)XMLComputer animation
Distribution (mathematics)TouchscreenDeclarative programmingXMLComputer animationLecture/Conference
Wave packetCategory of beingGoogolVolume (thermodynamics)Data storage deviceMultiplication signMessage passingEmailPoint cloudMusical ensemblePhysical systemPersonal digital assistantComputer animation
Demo (music)Point cloudStandard deviationData storage deviceSocial classGene clusterService (economics)Hand fanService-oriented architectureConnectivity (graph theory)Bootstrap aggregatingDirect numerical simulationLecture/ConferenceComputer animation
Demo (music)Service (economics)Ultraviolet photoelectron spectroscopyClient (computing)Inheritance (object-oriented programming)Service-oriented architectureBootstrap aggregatingDirect numerical simulationGene clusterComputer animation
Network topologyMetadataComputer-generated imageryInfinityService-oriented architectureBootstrap aggregatingTable (information)Server (computing)VirtualizationService (economics)Electronic mailing listIP addressDirect numerical simulationSource codeComputer animation
FreewareOpen sourceLecture/ConferenceXMLComputer animation
Transcript: English(auto-generated)
So hi, I'm Anatoli and you joined me today to explore the world of Kafka and Kubernetes. But first start with who am I? I'm an adventurer and a computer scientist and so the obvious
job choice for me is to be an IT adventurer. So I currently am giving Kafka training and also doing some Kafka and Kubernetes consulting as a freelancer. But today we want to talk about Kafka and Kubernetes.
Who of you heard about something about Kafka? Kubernetes? Who has Kubernetes running in production? Okay, few. Who of you knows something about Kafka? Quite a bit, good. Who of you runs Kafka in production? Wow!
Four people. More than expected. But then nevertheless, let's first talk about what is Kubernetes? I think most of you know what it is, but let's start with a very very rough and short overview. So for me, Kubernetes is a declarative container conservation tool. I'm a computer scientist and so
the most interesting thing for me is the declarative part. But let's first look at the world. What is a container? I think. Who does not know what a container is? Good, okay. So all of you know that guy?
And for us it's enough if you say that a container is a mini VM with one application per container. So we can have Postgres running in a container, GitLerp or Nginx. So far so good, so I think you all know that. So what is a container conservation tool then?
So the question is basically Okay, I have a runtime and I have a container and how can I put the container in the runtime? For one runtime and for one container, that's easy. You just let the container run on the one type you have. What if you have one runtime and N containers?
Again, containers are very very tiny VMs and it's not a problem to run multiple containers on one runtime. Now the question arises What if we have N runtimes and N containers? How can we spread the containers across the runtimes
given some heuristics or the way we want to have it? So maybe this could be a way to do that, but I assure that this one is in the best way. Maybe there's a better one. Or what do you do if the services want to talk to each other?
What do you do if you want to have and put the communication between the containers? And what do you do when something some of the failwales is on fire or somebody's in your network or the network is again broken on AWS? So maybe you should think about network, security,
fault-olent and so on. And this is something, as most of you know, Kubernetes helps us a lot with that. And finally the last word in this phrase is probably declarative. So declarative for me is I describe how the world should be like and how I want to have it. So I declare
which applications should run, which names do the services have, which security policies I want and then Kubernetes does all the plumbing and fulfills my wishes. I like to say to the computer, please, I want to have that word and the computer will do the rest for me.
I'm lazy guy, so I don't want to do it myself. So I think that's enough for Kafka. At least, of course, this is not a deep dive, obviously. For Kubernetes, it's not Kafka. Let's talk about Kafka again. So for me, again, I'm a theoretical computer scientist. For me, Kafka is a distributed log.
What is a distributed log? Let's talk about a log first. So a log is an append-only list. We can have here six elements, for example, a few characters and a white space, and we can add more data to the end of the list.
And what can we do with that thing, with that list, with that log? We can append data and we can, with the full log, start at a specific position. So if we talk about appending data, that's pretty easy. We have a function, let's call it push,
and let's say it takes a list of a's, so a is some type, it doesn't matter which one, and it takes another element of that same type and gives us back a new list containing all the types. So for us, it's just appending new elements. For example, we just appended a y, and
if we append elements, we get a new position or we call it offset. So the list gets longer and we can address the elements in the list by the offset. So we can add another element, and so on and so on. And then we can read the full log starting at a specific position.
So we have a function which is called readFromOffset. It takes again a list of a's. We don't care what is a and an integer or some number. And then it will return us the log starting at that position. For example, if we say that our current offset is zero and
we call the function with that list and zero, we will get back the whole log. So let's append a new element to the log, and we have to remember somewhere this current offset, and then call the function again with a new offset, and then we will get only a u back,
only one element, or a list containing the u. So what does that mean for us? In the log, we don't have any deletion, we don't have any random access, so we could tweak somehow the random access, but it's not intended to be used as random access.
And there's no magic in that. So we can describe a log just by the data structure and the two functions. That's it. No magic. The magic comes when we move away from the theoretical part and go to the practice. And if you're interested in more things about logs, there's a good article
from LinkedIn about the log and what every software engineer should know about logs. So distribute it. Kafka distributes this log. And how does Kafka do this distributed thing? What is rule number one of distributed systems?
Any ideas? Don't. But still we need somehow. We have a lot of distributed systems right now. Maybe you have heard of this thing called microservices, distributed systems. I think in every IT or computer science studies, there will be an electron distributed system. So distributed systems,
no, we don't want to do that, but still we need to. One of the two reasons are for them, we have too much data for one computer, and computers like to crash. I hope that might want to crash today. So if we have too much data,
errors occur, bam. So how can we solve that? So Kafka's approach is for too much data. Kafka is using partitioning. And for computer crashes, Kafka is using replication to cope with that. So again, we are not trying to build better and better hardware, which don't crash.
But we say, okay, we cope with that. This is a problem we have. Theoretically, we are not able to solve that. How can we cope with that? So what is partitioning? So a Kafka cluster, logically, is just a set of computers, and
we store multiple topics on a cluster. Because on a database, you also have not only one table, but database cluster, or database instance, mostly multiple. And then each topic, a topic is not a log. A topic is basically a set of logs.
And we are partitioning each topic, for example, that way, so topic A has three partitions, and if we read them as the partition one of topic A, SEA14, what does it mean? Maybe it does mean something, maybe not. But we just partitioned all data. So, for example, the input for that was some data 12345.
And the partitioning was just one problem. We put the S in the first partition, the O in the second partition, the M in the third partition, and then start it again. Okay. That's partitioning. So how do we do that?
Basically, we have a producer which writes something, and the producer itself decides to which partition to write. That's important. This is, in a traditional message queue, the producer writes to something, and then the system itself says where to put it. And in Kafka, you have to decide yourself where to put it. There are different strategies,
how you can do that. You can just write all data to one partition, but then probably bad things will happen. You can randomly assign partitions, data to partitions, and then multiple other ways. We will have some time in the demo to explore that.
And when we have multiple producers, the picture stays simple, but a bit chaotic. Every producer just writes to the partitions that need to write to. So writing is quite straightforward. Reading is a bit more complicated.
So if we have just one consumer, and we want to guarantee that each message is read once, and we would like to have it also once and only once, then we send all data, or the consumer fetches all data from all partitions.
That's nice, but what if your consumer crashed? Or if the one consumer is not fast enough, then you can just add more consumers. And then Kafka will spread the partitions across the compute consumers. Not Kafka is doing that, but the consumer libraries will do that for you.
Okay, we have two consumers and three partitions, and we send the first two partitions to consumer one, and the third partition to consumer two. What do we do when we have three consumers? We assign just one partition to each consumer, and we are happy.
Hmm, could we scale further? Yes, no? Does it make sense to have a fourth consumer? If we have a fourth consumer, then the fourth consumer won't get any data, because we cannot
split this further. If we want to guarantee that each message is read by exactly one consumer, but still this can be reasonable. For example, if one consumer crashed, then we will have another consumer which can take over the work and continue,
especially if you're running your favorite Java virtual machine, and it needs some time to warm up. It may be very useful to have a warm standby replica. And it's recommended in many cases to have that. Okay, that's about partitioning. What about replication?
So we talked already about clusters. So if we have a cluster, probably we have multiple brokers. In that example, we have three brokers. Now if you remember, we have a topic with multiple partitions. And we didn't put all partitions on one broker, because why should we have then multiple brokers?
But we have just... We put partition 1, for example, on broker 1, partition 2 on broker 2, partition 3 on broker 3, and so forth. And partition 4 then again on broker 1, because we don't have a fourth broker to put it on. Okay. Nice about replication. We have just spread our partitions across the brokers, but if broker 1, for example, fails,
we lose partition 1 and partition 4. So how can we cope with that? We replicate data. So we put, for example, partition 1, another replica to broker 2, partition 2 to broker 3, and so forth.
And if we lose broker 1, partition 1 is still available. It's on broker 2, partition 4 is also on broker 2, and partition 3 is on broker 3. So we will survive the crash of a broker. Okay, let's say
this is a very short introduction to Kafka. So what about Kafka on Kubernetes? Why would we want to do that? Is it because of the password bingo you have heard, or is it the so-called conference-driven development? Or is there something behind that that may be
useful or makes sense? So we have heard a bit about what is Kubernetes, what is Kafka, what could be that Kafka on Kubernetes? So for me, I see chances to have that. So what I like about Kubernetes is this declarative thing. I can just say I want to have
an application running, an engine exploring server. I want to have three replicas of my application. And if they don't have any state, everything is fine. I like about Kubernetes that it simplifies automation quite a lot, because
it can help us with failover plans and so forth. And if all your applications are already running on Kubernetes, then why not put everything on Kubernetes? Maybe it makes sense, maybe not, but let's see.
The challenges about Kubernetes, especially Kafka and Kubernetes, is that Kafka itself is very stateful. That's obvious. Kafka is a kind of a database, and Kubernetes at least through the last few versions was not very happy about stateful
applications, and it was quite challenging to deploy stateful applications in the current way on Kubernetes. Kafka itself, if you give Kafka a RAM or memory, it will take it all if it can. So the recommendation for a small Kafka cluster are, yeah, have three machines with 30 gig RAM each for a small cluster for the beginning,
and then if you need more, then get more of it. It also dislikes sharing CPU and so on, so and Kubernetes is all about resource sharing, so you just often you try to put as much containers on one machine as possible to save compute costs in your cloud environment.
And most importantly, Kafka itself requires very careful consideration where to deploy it. Kafka is very fast, but this comes at a cost. The cost is you should be careful, and you should know what you do. Kafka itself does not commit
data to disk. It waits until the operating system sometimes decides to commit data to disk. So if you have running multiple Kafka brokers on one machine, and that machine dies, you probably will lose data.
You shouldn't put two Kafka brokers on one machine. On one physical machine, maybe you have two virtual machines, and they are all on the same physical machines, you will have issues. You shouldn't put Kafka clusters in one rack like that. If your top-most switch dies,
and you lose electricity, or you lose network connection, bad things will happen. So you should be very careful where to deploy Kafka. It doesn't matter if it's on bare metal, virtual VMs, virtual machines, or on Kubernetes. You should be always very careful. And Kubernetes can help us with that, but we should be just aware.
So I would say, no matter what the infrastructure is below, or what platform you use, Kafka requires a very good infrastructure, and a team that can handle it. If you lack one of that, you should not use a new Kafka yourself. So my wish list for Kafka on Kubernetes is
we want to have stateful sets for Kafka and ZooKeeper. Stateful sets are one of the ways how to manage stateful applications on Kubernetes.
You really want fast storage, and you really want fast network. If you don't have fast storage, or don't have fast network, you will get troubles, or at least Kafka will be just very slow. So Kafka is very happy if it has fast storage,
and you can get a good performance out of it, but if your storage sucks, Kafka will suck for you too. And for Kafka on Kubernetes, I also wish to have services for finding clusters and brokers. And then, of course, nice to have would be some way to have declarative resources for everything.
So if you have multiple brokers, I don't want to create, to define all of them. I want you to say, hey, I would like to have a Kafka cluster with three brokers, four brokers, five brokers, how many I want. Who of you likes configuring SSL keys? Me not.
So I would like to have to say, I would like to have TLS encryption between all my services. Do it. I don't want to do that. I want to declaratively define which users I want to have, or which authentication method I want to use, and which users are authorized to do what.
And finally, we have multiple topics in Kafka, and it would be great to have a declarative way to define them. Because, hmm, if I do Kafka create new topic, and then I create another one, and then I create another one, I would maybe lose some time,
my overview, and you know what happens if, yeah, we have this topic, I don't remember what the name was, and who is using that, but somewhere we have the information about that. So a declarative way for all of that would be great to have. It's about
these infrastructures, the code, the password here, and Kubernetes provides us more or less for free, so why shouldn't we do that? So, and these are things that the Kubernetes operator is providing us. So what is a Kubernetes operator? So a Kubernetes operator is very similar to a Kubernetes controller,
and a controller is the thing that looks what you have defined, and what is the current state of the Kubernetes cluster, and makes things happen such that the Kubernetes cluster becomes the work we would like to have. And same for a operator.
I'm sometimes wondering why they have called it operators by not just Kubernetes controllers, but that's another topic. And the basic idea about a Kubernetes operator is that it codifies you knowledge, the knowledge of an operator. That's why the name. And most often,
especially for Kafka, the operators come with their own custom resources, so you can extend Kubernetes to have custom resources where you can say, I would like to have a Kafka cluster. I would like to have a Postgres database with that configuration. Please do that for me. I don't want to do it myself.
And at the count time, there are three Kafka operators out there. One is Twimsy. Twimsy is deployed by Red Hat and is the oldest of these three. A recent, very interesting operator is the Banzai Cloud Kafka operator.
It's relatively new. It just was introduced a few weeks ago. So if you start a new project, you should definitely look at the Banzai Cloud Kafka operator. It's very clever, they do. And if you have a lot of money and want to spend it, and you want real good enterprise support, you should probably use, you will probably use a Confluent operator if you have Confluent enterprise licenses.
If you don't use their other services. And we will talk today about Twimsy. So Twimsy you can find on Soundus. What?
So Twimsy is developed by Red Hat and Twimsy is the open source version of that operator, and if you want to give Red Hat a lot of money, then you should buy AMQ streams, which is the same product, but you get enterprise support for it. And Twimsy consists basically of three operators. One is the cluster operator, which manages all the cluster things,
the topic operator, and the use operator. I didn't tell you the whole story. Kafka does not consist only of Kafka, but it has also a so-called Zookeeper. Who knows what Zookeeper is?
Some of you. Okay, Zookeeper is basically a helper tool for Kafka to do cluster coordination, and it stores some metadata. Luckily it will go away soon or later, because nobody likes Zookeeper in the Kafka community.
Zookeeper limits your performance, Zookeeper limits the amount of topics you can have, and so on and so on, so they want to get rid of it. It will happen, soon, hopefully. So the cluster operator just does the provisioning for the Zookeeper ensemble, that's the provisioning for the Kafka
cluster, and also for the topic operator and user operator. Zookeeper itself comes without any transport encryption. Not nice. But you can use a TLS proxy, and the cluster operator configures everything for us, and this way Zookeeper is talking
to the TLS proxy, and we have encrypted traffic in our network. I like that. And you also obviously also do a global configuration for Kafka and Zookeeper with the cluster operator. Topic operators, the name says, does the topic management.
We can create new topics, we can delete topics, we can adapt a configuration the way we need, and the interesting thing is that it is also a two-way sync. What does it mean? So if we create a Kubernetes resource
for a topic, and the topic operator will create us a Kafka topic. But if we have an application that wants to create its own topics in Kafka directly, then the topic operator will create us also a Kubernetes resource. This way we will have always an up-to-date state of the topics in our Kubernetes.
We don't need to think about, I created the topic in Kafka or Kubernetes, we will have just a consistent way. Similar, the user operator does the user management, it does the authentication authorization, and it stores the keys and passwords in Kubernetes secrets.
This is not the most recommended way because Kubernetes secrets are not really secret. They are stored in plaintext, so you should think about a key management system, but if you have a good Kubernetes cluster, you already have probably a key management system for that.
And then you just say, hey, Strymzy, please use that key management system, and it works. Okay, so that's it for that part. Time for a demo. So let's see if that works. That one. Okay, so I have here a Kubernetes cluster.
It's on Google Cloud Engine. You can't see anything right now, but give me a second. So you can say kubectl, get namespaces.
So Kubernetes consists of namespaces where you can just logically group stuff in it, and it's relatively empty, so if you create a new Kubernetes class on Google Cloud Platform, you will get that. So we want to create
our own namespace. I will just say kubectl create nscafka, and switch to that one. Okay, it says, we are now in the namespace kafka.
Let's install. First, we have to install Strymzy, and one of the easiest way to do that is using Helm. So you just say Helm install, then you give the name, and say I want to install the Strymzy Kafka operator. Take some time. And now it says, yes, we are ready, and then we can do a kubectl get pod, and we see that we have one
pod that is running, and the readiness state is zero of one. So we have probably to wait a second. And I'd like, I have a command, which is just, who of you knows about the watch command on Linux?
Awesome tool. So watch is doing, basically, you say watch, please execute the following command every second, and please show me a diff, a graphical diff for that. And then you will see, oh, look, we just execute kubectl get pods every second, and we see here the
diff, we see that the age of the container changes every second. Awesome. Okay, now we have the class operator running. What should we do next? Let's create a Kafka cluster. And for that, we just use, whoops, that's the wrong one, we are using, so you are configuring
Kubernetes using YAML files, and there's, Strymzy installs an own custom resource definition, and this way we can say, yeah, I would like to have something in the API version, Kafka.Strymzy.IO, so we may think that Strymzy is still better.
And please do something with this kind Kafka. Then we give it a name, and then we specify. So I would like to have a Kafka in version 2.3.0, I would like to have three replicas,
I would like to have a listener, a plain text listener, we could also say, please do a TLS version, but for now, it's just a demo environment, it doesn't matter for us. Storage, I would like to have ephemeral storage, you should not do that in production, obviously, ephemeral storage means that there's no persistent storage, and if we kill the pods, the storage is lost.
For our environment here, I would like to have a Zookeeper ensembler with three replicas, and again, ephemeral storage. And then I say, yes, I would like to have the topic operator and the user operator. Then I just do a kubectl, apply, minus F, Kafka, and then we can have a look at our watch, what will happen.
Oh, something changed, so we got three Zookeepers, oh, they are fast today, they are already running, but we see the readiness state is zero of two, so we just wait a few seconds when they are ready.
Oh, they are already one of two, that's nice, two of two, and then we have a Zookeeper ensembler running. And you see what the, now, the Kubernetes class operator looks, oh, Zookeeper is up,
now I can start Kafka, because it doesn't make any sense to start Kafka without a running Zookeeper ensembler, because Kafka depends on it, so when Kafka puts up, it looks, oh, where's my Zookeeper,
and if Zookeeper is not alive, it will crash. And when Kafka is running, we create, the class operator creates a new entity operator, and the entity operator is basically one part for the topic and the user operator. Good. So, now everything is running. And now, we want, I would like to talk with Kafka and do some things.
So there are basically commands called Kafka, console, consumer, and, or console producer, and so on,
and the bad things about this is, sometimes, so you have to provide a bootstrap server to them, and some say, yeah, the command, the flag for that is bootstrap server, some of the commands say,
it's Kafka brokers, some of the old commands need Zookeeper, and so on and so on, and this is quite annoying. So there's, I created a so-called toolbox, where I just configured the whole information using environment variables,
and then you do not need to remember the parameters. So we have here, most of you who haven't seen something like that before. Okay, so this is basically a definition of a Kafka pod, a Kubernetes pod, a pod is the smallest entity in Kubernetes,
and I say I would like to have a pod with the name Kafka toolbox, and with the following containers inside. The containers are just normal Docker containers. So I would like to have a container called Kafka toolbox, with that image, oops, with that image, and with the following environment variables.
And the command the container should execute is just sleep infinity, because it shouldn't do anything, I just want to log in to it and do things with that. And I can create that pod by using kubectl apply minus f kafka toolbox.
It's created, and now we can have a look at our watch, and it's already running. And now I can connect to that. I just say kubectl exec minus it, minus i is interactive, and minus t says please I need a tty, and then I just say kafka toolbox, and I would like to execute bin bash.
And now I'm in that pod. So we can say who am I, okay, I'm kafka, and now we can do funny things here. So for example, we can say kafka topics minus minus list, then Java says some things,
and it gives us an empty list, because we haven't obviously created something. So let's create a topic. Creating a topic is relatively straightforward. Again, this is just a community's yaml, it says API version is swimsy, kind is kafka topic,
metadata, please name it my topic, please assign it to that cluster, and I would like to have three partitions and one replica. We have one replica, that's okay for us, that means if one of the, one replica goes down, we don't have, we lose that partition.
But that's okay for our showcase. So I just say kubectl, now I can't do that here, kubectl apply minus f topic, and then I can say kubectl get kafka topic,
and I will get my topic, which was named my topic with three partitions and one replication factor. If I do the same in my kafka toolbox pod, I just say kafka topics minus minus list, and it will say, hey, I know about that topic, my topic.
You can also say kafka topics minus minus describe, minus minus topic, my topic, and we will get the whole configuration for that. Okay, we have partition count of three, obviously replication factor of one, and then we see which partition is on which replicas, which brokers.
For example, partition zero is on our broker number two, partition one and zero, partition two and one. What else can we do? Let's write some stuff to it. For that, we use kafka console producer minus minus topic,
my topic, now I can write stuff, whatever we want to. So let's read some things from that. I will just open it in another side by side.
We again execute, I log into my kafka toolbox, so I'm in, and then I say kafka console, consumer minus minus topic, my topic.
We can't see anything. What's happening here? Because by default it just starts, it just reads new elements from that, and I can say minus minus from beginning, and if we are lucky,
yes, we see the stuff I have written to it. Okay. And now I can write stuff to it, and some time later, I don't know why it's so slow here, it doesn't matter for me too much. Look, test, and so on. We remember we have three partitions, yes?
And we can't see anything, yeah. Maybe you realize, look, bar was sent after this string and foo, and we received it first. That's weird. Why is that? Any ideas? Why is bar before the others? Yes, so he said that if we have multiple
partitions, we do not have any guarantee of the ordering between the partitions, so kafka guarantees us ordering only insights of one partition, not across the partitions,
so if you send two, one and two to the same partition, we will get the same order. If you send them to different partitions, nobody guarantees us the order. This is what we are experiencing here. So to see that, we can say to the console consumer,
I would like to have partition zero. Yeah. And then we see some things. Okay? So we see that partition zero had that long string and bus in it, and we can have a look,
let's connect to another one, and kafka console consumer minus minus topic,
so for partition one, we will see bar and test are inside. Let's do, oops. Okay? And let's have a look now, oops, oops, sorry. Let's have a look now at the
partition number two, kafka, and how to read it better, so we see that foo and bloop are inside,
so let's empty everything. Okay, now we can produce some more data, so for example, you run, it appears in the partition number zero, let's call one again, oh,
why is it partition two? Let's do it again, now it's in that partition. Hmm. Any ideas what is happening? We're just doing round robin. Let's start with that, one, then to that, then to that, and so on. That's a partition in the scheme,
so we are just sending data, and it appears somewhere. Let's do it more with some strategy. I need to copy that command because it's a bit longer. Key, so we now use the console producer, so again, I haven't told you the whole story,
kafka is not a log, not simply a log, or a partition, not simply a log, but it's a log of key value pairs, and you can just put key value pairs in that, and one of the reasons why we have keys and values is to have a way to distinguish to which
log to write, so we can have, say, let's write all messages with key one to partition one, and so on, and this way we can guarantee, for example, if you have a user database, and you just topic on this containing users, if you want to say, hmm, I have a lot of users,
but I would like to guarantee that the data for each user is in one partition, such that the order in that partition, or for that user is always preserved, because if you spread the data for one user across multiple partitions, there's no guarantee that you will get the correct order, and then bad things can happen, so we just do that.
We just say, please provide us a way where we can insert key values, so I say just one, one, so it just says the key is one, and the value is also one, and then I write it.
Ah, it's maybe a bad example. Let's clean that, and now every time I use the key one, it will appear in the partition number zero. The value doesn't matter, and you see that,
by default, the consumer doesn't show us the key. Most of the time, it doesn't matter too much. Some other value, and it's always sent to partition zero. If we say two,
and put some data in it, then it appears there. You see? And that's guaranteed. So the console consumer just uses, in that case, the default partitioner, and the default partitioner uses a hash algorithm to just decide to which
partition to send the data. And then, let's say, three, and three again. The key three will appear in partition number two again, not in one, because the hash algorithm is designed in that way. Let's try where will appear four, and now we have, we know that
key one will appear in partition zero, key two in partition two, key three in partition two, and key four in partition one. What else can I show you? Ah, yes. And now,
let's exit that, and if you create here a topic, so in the default configuration, Kafka creates us a topic automatically if we write data to it. So we can say Kafka console, I can clean that, that. Okay. Kafka console producer,
Kafka console producer, minus minus topic, new topic, which doesn't exist. Minus minus,
yeah, that's it. And now we just, oh, we get an error. Leader is not available. That's weird. That's relatively obvious, so we're just sending a request for a topic that
doesn't exist currently, and that's why Kafka is creating now a topic for us, and we can have a look, and if we continue writing something to that,
um, we won't get any errors anymore. And now if we are using Kafka console consumer minus minus topic, what was the name, new topic, from the beginning, we should get the
values, yes, so we can add the stuff. And now the interesting thing is, we remember that the topic operator works in both directions, so if we do kubectl get Kafka topic, we are getting now the new topic too, and it has default configurations of one partition
and replication factor of one. In a production environment, you probably should disable the functions that topics are created implicitly and automatically. That's not a good way to do that. And we see that there's also this another topic that's called consumer offsets,
and the consumer offsets topic is an internal topic where Kafka is storing the offsets. This way, Kafka, the consumers know which messages they have for it, and which they have to read. And then they go to the topic, say please give me all consumer offsets, and then
they can decide where to start, where to continue reading. Okay. So that's it for the demo. I hope it will bring back the correct screen. Yes! Okay, so what have we explored today?
We have explored a bit of communities. Communities for me is a declarative container accusation tool, Kafka is a distributed log, I think. If somebody asks what Kafka is, Kafka is a distributed log, and then you can go away, because the rest are obvious. Not. There are also other ways, but for now, that's one of the ways I explored,
and the others I haven't did so far. So thank you for your presence here, and if you want to learn more about Kafka, I will give my next Kafka training in September
in the Linux hotel in Essen, and it's only in German, but if you want to get there, ask me, or if you have more questions, you can just send me a message the way you'd like to. Thank you. So we have some time for questions, so any questions so far?
You showed only the ephemeral storage or persistence for Kafka on communities, how would the persistent storage work with the pods and all the volatile breaking down things and so on? I will leave that. The persistent storage works as usual in communities,
so how fluent are you in communities? Not bad. So communities has a concept called persistent volume claims, and there you can say I would like to have a persistent volume with that properties. So for example, in Google Cloud,
so it depends on your environment, if you are on a cloud like Google, you can say kubectl get storage classes, which is short for SC, and then there's a standard storage class,
and then you can create your own and so on. So you would probably want to have fast SSD storage in your Kafka cluster, and you can just define I would like to have SSD with that much storage, and that's it. So it's relatively straightforward.
Again, you should be relatively careful how to choose that. And as long as I have the mic, can you talk a little more about how all the components
found each other, like the toolbox you had, how did it find the Kubernetes brokers you previously set up? Okay. So the thing to find things in Kubernetes are services. Services create DNS names for your thing, so you can define, for example,
a Kafka service or your my hello world service, and then you have a name for that. So if you say kubectl get service, we will see that we have four services here. So we have a Kafka bootstrap
service, we have a Kafka broker service, we have a Kafka zookeeper client service, and zookeeper node service. And then you see that the bootstrap service has a cluster IP, you can either use that IP to contact the bootstrap service, or you can use that DNS name.
And if we have, we can have just a look there, kubectl get services, or describe. And then you see that we have multiple endpoints to that.
Here, this is what we care about. And these are basically the Kafka brokers. And then you have Kubernetes will create IP tables rules, if it sees the DNS, or if you contact the IP address, it's a virtual IP address, and it will just randomly assign you the IP and go to one of the
brokers. And we defined in our toolbox that the Kafka bootstrap server value is that one. And then we can just use the nice DNS name to do that. Normally, you enter a list of your bootstrap servers, of all your brokers, but we don't need
to do that in Kubernetes because we have the service and it will just randomly assign you one broker where you contacted. And then it will give you back the IPs of the names of
the brokers. And if not, then thank you. Thank you for your attention. Thank you.