Advanced Infrastructure Management in Kubernetes using Python
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 130 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/49915 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 202088 / 130
2
4
7
8
13
16
21
23
25
26
27
30
33
36
39
46
50
53
54
56
60
61
62
65
68
73
82
85
86
95
100
101
102
106
108
109
113
118
119
120
125
00:00
NetzwerkverwaltungHexagonPhase transitionConfiguration spaceData miningDatabaseFocus (optics)Uniform resource locatorCartesian coordinate systemComplex (psychology)Time zoneGoodness of fitConfiguration spaceService-oriented architectureQueue (abstract data type)ResultantFocus (optics)MereologyMathematicsMultiplication signOperations support systemConnected spaceBackupOrder (biology)Open sourceDatabaseCASE <Informatik>Computer architectureObject (grammar)SpacetimeLevel (video gaming)Computing platformPhase transitionBitSoftware frameworkMappingSoftware engineeringNetzwerkverwaltungProjective planeFront and back endsNumberExtension (kinesiology)Software design patternPhysical systemGroup actionMessage passingProduct (business)Task (computing)Electric generatorRevision controlRow (database)Process (computing)Configuration managementCuboidVolume (thermodynamics)Point (geometry)Different (Kate Ryan album)Template (C++)EmailWorkstation <Musikinstrument>Computer configurationDesign by contractParameter (computer programming)Medical imagingRhombusShared memoryExecution unitOptical disc driveMatching (graph theory)Series (mathematics)Standard deviationExtreme programmingElement (mathematics)Menu (computing)Key (cryptography)AreaState of matterBit rateContext awarenessWordOpen setDistribution (mathematics)Universe (mathematics)Suite (music)Virtual machinePlastikkarteMeeting/InterviewXML
07:47
Software developerChi-squared distributionCartesian coordinate systemLevel (video gaming)DiagramFlash memoryMessage passingService (economics)DivisorBlogMedical imagingBefehlsprozessorProduct (business)Relational databaseNumberResultantConfiguration spaceNetzwerkverwaltungDatabaseData structurePhysical systemSoftwareCASE <Informatik>Game controllerStandard deviationQueue (abstract data type)Parameter (computer programming)Different (Kate Ryan album)Semiconductor memoryExtension (kinesiology)Software developerLatent heatEvent horizonLogicMultiplicationMobile appMultiplication signDirection (geometry)QuicksortComputer configurationCuboidState of matterObject (grammar)Service-oriented architectureWorkloadOperations support systemGroup actionDomain nameMUDOptical disc driveExecution unitCubeWhiteboardDisk read-and-write headSound effectMenu (computing)Metropolitan area networkView (database)RhombusSource codeOrder (biology)Right angleAverageNumbering schemeMereologyScaling (geometry)WebsitePlastikkarteConsistencyOval
15:09
Type theoryClient (computing)Game theoryAreaMilitary baseNumbering schemeOpen set2 (number)XMLSource code
15:42
Source codeProduct (business)Parameter (computer programming)Online helpBit rateNumberConfiguration spaceProof theoryFlash memoryRight angleLevel (video gaming)Expected valueRevision control
16:40
EmailComputer clusterProjective planeVideo gameDatabaseSimilarity (geometry)Cartesian coordinate systemSource code
17:12
Game controllerThomas KuhnPattern languageOperator (mathematics)Core dumpRight angleLevel (video gaming)Dependent and independent variablesAxiom of choiceControl systemGame controllerNumberService (economics)Pattern languageSoftware frameworkOperations support systemComplex (psychology)Mobile appDomain nameBlogCartesian coordinate systemState of matterSoftwareGroup actionLoop (music)Formal languagePhase transitionObject (grammar)DatabaseCellular automatonOpen source2 (number)VideoconferencingAndroid (robot)Utility softwareVideoportalWritingPhysical systemSoftware design patternFunctional (mathematics)CodeDemo (music)MereologyProduct (business)Core dumpException handlingVideo gameDecision theoryTelecommunicationMultiplication signGraph coloringActive contour modelOcean currentUsabilityUniform resource locatorImplementationCoefficient of determinationDiagramComa BerenicesWhiteboardShooting methodSource codeGoodness of fitNetzwerkverwaltungTask (computing)
26:03
Right angleGroup actionTask (computing)Virtual machineNumberVideoconferencingObject (grammar)Service (economics)Configuration spaceFunctional (mathematics)Computer clusterMetropolitan area networkTime zoneSource code
27:10
Machine visionProduct (business)State of matterFunctional (mathematics)Group actionConfiguration spaceResultantQuicksort
28:10
Menu (computing)MathematicsStrategy gameState of matterNumberMusical ensembleMaxima and minimaInformationConcurrency (computer science)Level (video gaming)Perfect groupSource codeXML
29:41
Priority queueLengthScaling (geometry)Student's t-testMereologyNumberBasis <Mathematik>2 (number)Message passingWordService-oriented architectureOcean currentAverageBlock (periodic table)Queue (abstract data type)LengthService (economics)State of matterPatch (Unix)
31:07
Queue (abstract data type)Limit (category theory)2 (number)MathematicsService-oriented architectureMessage passingLengthTask (computing)Reduction of orderNumberMultiplication signGreatest elementRule of inferenceAnalytic continuationSource codeXML
32:03
Analytic continuationProcess (computing)NumberCartesian coordinate systemMessage passingMathematicsQuicksortBitComputer animationSource codeXML
32:38
Operator (mathematics)Mobile appBuildingMereologyOperations support systemCartesian coordinate systemState of matterSoftware frameworkProcess (computing)Product (business)Open sourceDebuggerProof theoryFeedbackNumberMultiplication signMobile appGame controllerTask (computing)Software developerExtension (kinesiology)DiagramDifferent (Kate Ryan album)Repository (publishing)Formal languageFlowchartProjective planeLink (knot theory)Queue (abstract data type)Group actionPattern languageCodeRight angleScaling (geometry)Slide ruleStandard deviationMeta elementPlug-in (computing)Windows RegistryFinite differenceImplementationDemo (music)Software maintenanceWritingLevel (video gaming)CircleGame theoryWave packetNormal (geometry)Moment of inertiaDirection (geometry)Personal digital assistantOrder (biology)WhiteboardDemosceneCoefficient of determinationChainMultiplicationGoodness of fitShared memoryNatural numberInformation securityRow (database)Video gameSource codeXML
39:26
Numbering schemeOperations support systemGroup actionLimit (category theory)Task (computing)Matching (graph theory)NetzwerkverwaltungRoundness (object)Different (Kate Ryan album)SoftwareSoftware design patternRight anglePattern languageMessage passingComputer animationMeeting/Interview
Transcript: English(auto-generated)
00:06
Thank you. Hello, welcome everyone to EuroPython. Good morning, good afternoon, good evening. We're from whichever time zone you're in. So this talk, as evident from the title, it's going to be around managing complex applications on Kubernetes while staying in
00:27
Python ecosystem. So Kubernetes is mostly written in Golang and it has the most active community in Golang, but I wanted to introduce some of the frameworks and tools and software
00:40
patterns that we can actually code using Python while staying in Python ecosystem. We can automate all of that stuff. So for this particular talk, if you have some basic understanding of Kubernetes concepts like pods, deployments, or services, it will be best. You'll be able to get the most out of this talk if you have knowledge around some of
01:03
the basic Kubernetes concepts. A little bit about myself, I'm Gautam and I'm a software engineer from Grover's India, one of the largest online grocery shopping platform in the country. We run a fleet of more than 20 microservices on a Kubernetes cluster,
01:23
which like on the extreme zone, we go up to a million daily active users on those microservices. They are written in Python, Flask, Django, and I completed my bachelor's in software engineering from Delhi Technological University. I graduated back in 2018. I did GSoC with the
01:41
LibreOffice. I love open source. I have contributed to Mozilla, Firefox, for Android, OpenMR, medical record systems, Faucetia, OpenEL projects, and others in the past. And as Sangarshan said, this is going to be my first talk in any conference for that matter. So please bear with me. Okay, so I've divided this talk into four phases.
02:08
And for phase one, we are going to talk about introductions. We are going to introduce and discuss some problem scenarios that come from running applications in Kubernetes. We're going to discuss problems with configuration management, setting up a database
02:22
cluster. And then we are going to introduce a focus problem for this talk, which is around running a silvery cluster on production. And since silvery is a very popular distributed task system written in Python, I chose that for this talk. And in phase two, we are going to generalize the learnings, all the manual steps that we do, what all the different pain
02:46
points are of managing stateful in general in Kubernetes. And we are going to discuss the goal for the silvery automation that we are going to do. We are going to solve each of the manual steps incrementally. And then we are going to discuss the extension capabilities
03:01
in Kubernetes that are going to help us achieve that automation. In the phase three, we are going to build that solution incrementally and at each step, whatever manual steps we discuss, we're going to automate them and we are going to see them in action. We're going to see the custom silvery resource and then we'll see the operator. Operator is
03:21
something that I'll discuss later in the talk, reacting to the events. We will also see the auto-scaling and downscaling of silvery workers based on queue depth. That is not really provided out of the box by Kubernetes. And in the phase four, we'll conclude, we're going to see what is world doing with operators for existing frameworks, SDKs and other use cases.
03:43
And then we'll proceed to Q&A. Okay, let's start with the first problem. These are real-world scenarios. I'm going to discuss three real-world scenarios as part of problems. Rather call them opportunities where we can automate stuff. So this is a very common problem with configuration management in Kubernetes. Kubernetes provides config map and secret object
04:03
to actually manage your configuration on cluster. And there is a very common problem with that, that whenever you need change of value in config map, you need to go and restart the corresponding deployment whenever that value is modified. So this is a very burning issue.
04:21
You can see from the number of reactions on this issue, which has a Kubernetes open source project. And one way of solving this would be to imagine like there is a Watcher pod that was managing those config maps and deployment objects. And as soon as you change the config map values, it automatically triggered the relevant deployment. This is one of the opportunities that
04:41
has a potential of automation. And coming up to a slightly complex example, this is when you set up a database cluster, anything Postgres or MongoDB, running a database is actually easy. You just need to write a deployment spec,
05:00
the declarative spec that you're going to provide. You are going to define a persistent volume. Then you're going to claim that volume into the deployment. Running is as simple as that. However, managing that cluster over the time is difficult. You need to set up connection pooling. You need to manage resize or upgrades in case space runs out or you're running on a lower
05:21
version. And you need to take care of reconfiguration, which requires operational expertise. Like if I'm working with Postgres, I need to know it's internals. There is a previous generation templating and so on. These are some of the various problems like backups and recovery as well, which need an infrastructure operator to actually do all these things manually.
05:45
Now, coming to something that is very popular in Python ecosystem called Celery, which is also going to be the focus of this talk. So let's start with the basics. What is Celery? Celery is a popular distributed task queue system. And I work for an e-commerce
06:02
company. So there are the typical use cases that we use Celery for our asynchronous workloads, like sending emails, SMSes, or doing anything that is post order has been placed, triggering cashbacks or promotions or rewards to the users based on the order status changes. So this is what we use Celery from. And if you see on the bottom left, you see a very basic
06:27
Celery Flask application architecture. There is a Flask application that pushes messages to a broker, which could be Redis or RevitMQ or any other broker. And then Celery workers,
06:41
which are running the big tasks from that broker and process them and sends the result back to the broker or wherever you have configured. And this is what a very simple Flask Celery application looks like. You define a Flask application, you define the broker URL, which is a Redis master in this case. You just define the result backend. And there is a very
07:01
simple task that is defined, which simply adds the two numbers and returns the result back. So this is going to happen asynchronously later on. And this is the command that needs to be run to start a Celery worker. You need to provide the path to your Celery application. You need to pass the worker argument, and then you have tons of configuration options,
07:23
concurrency, logging level, and all those things that are provided by Celery. So this example that we saw was a very basic Flask Celery application that you can set up on your local and try it out. Now, when you have to do this on production, when you have to deploy this Flask Celery example on production, you need to,
07:45
on Kubernetes specifically, then you need to have a worker deployment YAML that looks somewhat like this. The kind deployment, how many number of replicas you want, how many number of workers you want. And then you specify the containers like Celery or the container name Celery,
08:03
and then image which you're going to pull, and the command that is going to be running inside the containers. And then there are different arguments like queues, queue name, log level, concurrency, and all those things. So resource constraints that you can specify. This is one manual step that you need to write a worker deployment YAML.
08:21
Then when you're running on production, you need to set up monitoring as well. You need to make sure that your Celery workers are running all the time and your broker is healthy, your messages are being processed or not. And the factor standard for that is floor, floor or whatever you'd like to pronounce it with. So the factor standard to monitor Celery's
08:40
floor, and then you need to also write a floor deployment spec and you need to expose that deployment as a service so that people who are outside the cluster can access and actually see whether your cluster is working fine or not. And then you have to also manage auto-scaling when you're running on production. You never know that when there is going to be a high
09:01
workload or low workload. So you need to set up some kind of auto-scaling in a place where you actually might want to scale the workers on a resource constraint like if my Celery or if my CPU or memory has increased beyond a certain limit, then you need to scale it. Or a very specific thing to Celery that if number of messages in your Redis queue are more
09:26
or are increasing, you can simply scale the number of workers to maintain an average value, which is going to be processed by each worker. But this is not really supported in Kubernetes directly. And summarizing these problems of running a Celery cluster on production,
09:44
what all you need to do, there is a blog diagram on the right. So there is this worker deployment, which we discussed, which is going to manage the Celery worker pods. There is Flour deployment, which is going to manage Flour pods. There is Flour service, which is going to send their request traffic to Flour pods and show the results back to the user.
10:03
And then there's the Flask Celery simple example that we saw. Flask application is going to push messages to the broker and Celery worker pods will pick the messages and keep processing them. This is a typical Celery cluster in production. Now, coming to the problems when you are going to manage this cluster on production, it's not easy to get a new setup, right? We saw all the
10:25
manual steps that we needed to take a look. And there is no way to actually set up multiple clusters in a consistent way, because if you are working in a different team, when you have more than 100 engineers who are using Celery for the different use cases, there is no way to set
10:42
up any cluster in a consistent way. Everyone is configuring their own way. There are a lot of possibilities of misconfiguration because Celery and Flour, they both provide tons of configuration options. You might misconfigure conspiracy control or logging level or anything that can go wrong will go wrong in production. And later on, it also creates problems with
11:03
infrastructure audit. Nobody knows that how many resources are being used by this cluster, whether it actually requires it or not. So all these things are problems when it comes to running Celery on production. Generalizing these learnings, these problems are opportunities.
11:22
We can simply say that managing stateless on Kubernetes is easy but stateful applications like databases, caching systems, message queuing systems. They need specific domain logic of how they are to be set up on a production and the scaled upgraded or recovered in case any disaster
11:41
happens on production for a typical business use case. And Kubernetes is designed for automation. It is possible to extend its behavior to manage all these complex applications while staying in Python ecosystem. And also there is one more problem that you need to bridge the gap between
12:01
application engineers and infrastructure operators who actually manage these services. And next, we are going to discuss the goals for this problem. As mentioned here, deploying and managing stateful software like Celery, it should be made easy for everyone.
12:20
And Kubernetes has led to a wider adoption because of its declarative way of specifying the configuration. And if I could specify my Celery deployment, something like this, there's this kind Celery and there's this common spec where I can provide my app
12:41
name, my path to Celery app, and then the image that I'm going to run. And then the worker spec, the number of workers. And I've limited it to very simple configuration options, which Celery provides right now. And similarly, there is flash spec and resource constraints and all these things that you can configure. If I, as an application developer would be able to
13:04
specify a YAML like this, and I do nothing more than a kubectl apply my spec.yaml, it should be able to set up all the worker deployments, they're monitoring, they're scaling automatically in the best way possible. This is the goal for the stock that this is what we
13:22
are going to achieve in the end. The Kubernetes should be able to understand this spec and take actions accordingly on the different events that are going to happen. So, as I discussed, there is this kind Celery which is specified here. Now, Kubernetes does not know out of the box what is Celery or what is your Postgres,
13:45
any other database. It knows what is a deployment. It knows what is a pod. It knows what is a service. So it is possible to extend that behavior using a concept called CRT, custom resource definitions. I can define my custom resource in Kubernetes and I can extend
14:02
Kubernetes APIs to actually understand that custom resource named Celery. So it also let you provide a structured schema, what all workers spec, flash spec that we saw. It lets you define the structure schema of that custom object. And that helps in standardizing the specification across the Kubernetes cluster that you are running for the multiple Celery
14:25
applications. So this is the blog diagram that we saw. There was this native kids object. There was worker deployment, flyer deployment service that we saw earlier in the talk. And then there is the CRD, which I'm going to define the CRD and then custom resource.
14:43
And then that resource, that Celery resource, which we saw will have some sort of status. And it is going to pass through some sort of logic that we are going to discuss next, but this should be able to, the Kubernetes should be able to understand all this.
15:03
That's the whole aim. So how will this custom resource definition for Celery look like? This custom resource definition will look somewhat like this. I define a custom resource definition. I type the metadata, I type the client Celery and the short names, and I
15:26
specify an open API V3 schema for this object. I'm going to show you this in full. How does this, just a second. Okay. Yeah. So this is what my custom resource definition for
15:48
Celery looks like. This is a very proof of concept version right now. It's not a fully production version to make the talk simple. And this is the spec that I'm expecting.
16:00
This is the common spec, the common configuration parameters that I can pass in. Then there's this worker spec, which can accept all these number of properties, like number of workers you want, queues, log level concurrency, and then there's flash spec. And towards the end, we have the auto scaling targets as well. So this is how this
16:21
custom Celery resource will look like. And with the help of this, Kubernetes cluster will be able to understand my custom resource of Celery that looks like this. Kind Celery and the common parameter workers spec, flash spec, and all the scaling targets that I've specified.
16:40
Okay. So a simple way of creating this custom resource definition that is defined is using kubectl apply deploy slash CRD dot YAML. If you get the CRDs, you'll see that Celery is created, Celery project, Celery CRD is created. And when I create my custom resource,
17:01
which is deploy slash CR dot YAML, I can also get the Celery applications that are running currently on my cluster. Now, if you see that right now, nothing will happen. Kubernetes is just able to recognize that there is some Celery resource that has come in and I have
17:21
to accept it and I have to store it in the database. That's it. Now coming to something that is going to react, that is going to make all those automation happen. Now we are going to talk about controllers. Controllers in Kubernetes are at the core of its cell filling capabilities and they continuously execute control loops for all the API objects they're
17:44
watching. On the right side, there is a very simple example of a replica set controller. So you specified that I need number of pods that are equal to three. Then this replica set controller constantly runs a control loop that is going to make sure that this number is always
18:06
there in the system. It continuously checks the observed state. It takes the decision like whether the number of pods is more than three or less than three, it'll create more pods and delete extra pods accordingly and it'll eventually make sure that your desired state
18:20
is reached. Now Kubernetes also provides flexibility to write custom controllers to manage your custom resources or custom resources that I created with Celery and I can write a custom controller that is going to be watching my Celery resource and take appropriate actions. So coming to this reconciliation loop, so Kubernetes works on this concept of level triggered versus
18:48
edge triggered, which some of you might have studied in electronics. So what happens in level triggered is that when a signal goes from zero to one, there is a loop that continuously
19:03
executes on that level until that signal comes down to zero. And similarly in the edge triggered concept, what happens when the state changes from zero to one, then only your code, your thing will be executed. But in level triggered, it continuously makes sure that your signal
19:23
should go to zero or your signal should go from zero to one while it is there. So, okay, coming to the next part, when you combine CRDs with custom controllers that you are going to define, you build this thing called operator pattern. Operator pattern is a way of managing,
19:46
it's a software pattern, a design pattern, which lets you actually manage complex applications in Kubernetes. It'll take care of creating scaling upgrades, recovery, and more. And later in the stock, we are going to actually code that controller, which custom controller, which we saw
20:04
and operators are simply these software that actually extend the native Kubernetes abilities to reliably manage all these complex applications. It was introduced by CoreOS, which is now acquired by Red Hat. You can simply call them a Kubernetes native app,
20:23
similar to what you have for Android apps or Android exposes the APIs on which you can build apps. Similarly, Kubernetes exposes APIs to build apps for itself and operator pattern is one of the design patterns you can follow to actually build a Kubernetes native app. Again, all operators are controllers, but not every controller is an operator.
20:43
There's a very important distinction that controller is what we saw earlier. It could be a very generic one that just runs a reconciliation loop like replicas at controller deployment control, but operators are actually custom controllers that have the operational knowledge baked in their code. Now, coming to the implementation, operators can be written
21:05
in any language runtime, which can actually interact with Kubernetes API. This talk specifically encourages writing operators and supporting frameworks in Python ecosystem. Right now, Golang is a popular choice because the whole Kubernetes is written in Golang.
21:23
So, this talk is about what all things you can achieve while staying in the Python ecosystem with Kubernetes. There are a lot of examples that the existing operators that are out there, like Prometheus operator is there, etcd operator is there, MongoDB operator is there. It's as
21:43
simple as installing these operators and they're going to take care of managing the whole cluster for you. Coming to the controller part that now we are going to implement our custom controller
22:01
for Celery. So, let's start with creation. Whenever I created my new Celery resource, there should be something that is going to react and bring all the worker deployments, Flour deployments, and Flour service, all the manual steps that we did for running a
22:22
fast Celery on production. So, I have used a popular framework called KOPF, Kubernetes Operated Pythonic Framework. It's open-sourced by Zalando. It's a German-based e-commerce company. So, the general idea of that framework is that it takes care of interacting with the
22:45
Kubernetes API automatically, and it exposes the handlers that you can code. So, you just need that domain expertise to code in Python and everything else will be taken care of. You don't need to know the Kubernetes internals to actually write a controller.
23:03
So, this is a simple watch that I'm doing on my Celery resource. This handler is going to be fired when I'm actually creating my Celery resource. And then number one, it's going to validate the spec. If the incoming spec that you have specified is valid or not, and then
23:26
it instantiates the Kubernetes APIs, and it deploys the workers, Flour, and services. These are simple utility functions that just hit the Kubernetes API using a YAML. And
23:40
this returned the updated number of children that all the children that it has created, they're going to go in this response of this create function. So, if you can see in the blog diagram, this creation handler is watching the custom resource and it sends back the status
24:00
as all the children that it has created back to resource status. Now, I've talked enough. Let's see something in action. So, I've made a demo video. All right. So, first of all, we're going to see the CRD and CR creation. I talked about how you create a
24:25
CRD and the CRD.yaml we see now. So, now Kubernetes cluster will be able to recognize the Celery resources I created. I'm now going to create my custom Celery resource as well. So, I recorded a video. I wanted to do this live, but my system is kind of low on RAM when you're
24:46
doing the video sharing and all those things. So, okay. Now I did that. I created the custom resource. Now I'm going to deploy my handler or my operator. Just a second. As soon as you see
25:08
on the right, I have created a watch on the pods. As soon as the operator comes in, it's going to create, it's going to identify that I created a custom Celery resource and it's going
25:20
to identify that I created it and it's going to execute all those things. It's going to create deployment for Celery workers and Flour service and all those things automatically. Now we're going to see if our cluster is in healthy state by checking the Flour. So, this is my service
25:48
that has created Flour. Okay. Yeah. So, this is the Flour UI that I have two Celery workers
26:11
that are currently online. I have not started pushing in tasks yet. I can simply monitor the application like this. Okay. So, this is what we saw the creation handler in action right now.
26:27
And similarly, if I do, I'm going to check the status of creation handler that is going to stored in this custom object that I created. And if you see this creation function is,
26:47
sorry, that's the problem with video. You cannot go along with what you're saying. So, this is the creation function. It's going to store all the children it created, deployment, how many number of work replicas it did, services and the configuration for that service.
27:08
Okay. So, moving back to the PPT, now what we saw was the creation handler. Now we might want to edit our cluster configuration in production when it's running
27:24
on production. So, you need some sort of updation capabilities as well. So, for the updation, there is this update function that now I just got the diff from like what all things that I updated. I'm going to know that what all different spec that was updated. If I
27:44
modified the common spec, then I need to update all the deployments for the salary cluster. And if I just modified the worker spec, I need to modify the worker deployments similarly for the flower. And the updation handler, again, it's going to return the
28:01
result back to the resource status, which we saw. Okay. So, let's see the same in action. I'm going to edit my salary resource. So, I'm just going to leave the custom spec as it is.
28:42
I'm going to make my flower replicas to two. I'm going to change my worker spec to, let's say, I want my concurrency back to one and log level to info, number of workers to be four instead of two. And you can see the current state of the cluster, all the pods that are
29:06
running in the cluster. And as soon as I edited my handler invoked the updation handler and it created more workers for me. So, this is eventually going to reach that state of four
29:24
workers. Right now, the deployment strategy is set as a rolling update. So, it's going to do that rolling update with the max search possible. Perfect. So, now moving on to
29:44
the auto-scaling part, which is the coolest part of having this operator. So, how do you actually handle that? You need to auto-scale your worker based on queue length. So, there is a
30:01
seconds and it actually hits the flower service to know what is the current status and what is the current queue length in the broker. And it publishes back to that status. So, this is as simple as just publishing the message queue length that is taking in from the flower service to the resource status. And coming to the auto-scaling part, there is a watcher that actually,
30:27
there is a handler that is watching the published queue length. And as soon as that is changed, it is triggered. So, it actually takes in what is the number of current replicas, what all the state scaling targets are, and what are the number of replicas, max replicas.
30:46
And it's through a simple algorithm. It's going to make sure that updated number of replicas are equal to the average number. And it's going to patch that worker deployment to actually make that happen. So, this is the block diagram, simply just watches the queue.
31:03
It triggers scaling of the worker deployments based on whatever queue length it is getting. So, now if you see right now, auto-scale handler demo. So, if you see on the bottom, you see this, there's a message queue length or timer that is being invoked every 10 seconds.
31:22
And it's right now, I'm not pushing any queue, anything that in the British broker. And I just created a flask example that is going to bombard or reduce queue with number of tasks. And now the changes will start to happen. And as soon as the status changes, like the
31:45
number of messages increased, the auto-scale handler was triggered and it increased the number of workers to handle that increased load. As you can see, it updated the number of replicas to five, which was our limit as we are currently combating the task queue.
32:05
And as you can see that all the workers are online and they're processing continuously. And now I'm going to delete my flask application to see that whether they're downscaling is going to happen as the number of messages is going down.
32:26
So, let's wait for a bit so that all these silly messages, which are being processed, they go down in number. All right. Yeah. So, the changes have started happening. As soon as the application went down, which is pushing, it's going to terminate all the extra
32:45
parts and take care of managing this whole cluster automatically. So, you can see that this is really cool stuff, right? I just needed to do, I just needed to deploy that operator
33:04
once and I just needed to specify my customer resource as a declarative spec and that is it. And it is going to take care of setting up to updation to actually scaling the resource
33:20
automatically. Okay. So, coming to the merging all those diagrams that we saw previously during talk, we started from here with a very basic flask salary example. RedisMaster is there, it's going to, the queue is being consumed by salary worker pods. And then there's a creation
33:40
handler that we saw. It's going to create all these nodes, updation handler is going to update all these deployments. And similarly, this is a flow chart of the whole operator that I built for this talk. It's a proof of concept stage. It has still has a way to go for production. But yeah, this is something interesting that I wanted
34:06
to share with the community. Okay. So, we're moving towards the end of the talk. What all the things that we talked about, we saw the problems and opportunities from
34:21
application on Kubernetes. We saw the manual steps, what steps that we need to do to actually launch salary on a production Kubernetes cluster. And then we saw the goals, what should have, what I want to have as an application developer or an infrastructure engineer.
34:43
And then we discussed the extension capabilities, controllers and operator pattern and CIT. And we saw the creation, updation and auto-scaling implementation in action. Okay. So, next steps for this project. I wanted to learn about operators in general. So, I created this project
35:04
and this is live and open source on my GitHub. And there is still some way to go for making it production ready. If you're running salary on production, if you're running Kubernetes on production, you're more than welcome to actually tell me about what else we can improve
35:24
on this operator. And I'm going to be committing certain numbers weekly based on the feedback that I get from this conference and the other ones. And the north star aim for this operator could be to actually include it with a salary five release milestone. I'm here to discuss it
35:43
the salary maintainers. And there is an ongoing discussion on salary enhancement proposals repo around the same that they wish to have a Helm chart or an operator for salary. So, this is going to be exciting. Okay. So, what are different people doing with
36:02
operators? It's a relatively new concept. It was introduced back in 2016 itself. There is a repository called Awesome Operators. You can see all the awesome operators that people have built using Golang, Python, and all the other languages. There is a registry of operators
36:20
as well, like this operator hub.io. There are different operators for all these applications that are very famous and used in production clusters, Prometheus, Airflow, Couchbase, MongoDB, Consul, and all these applications. And there's an idea that I wanted to share. When you're actually running a fleet of more than 100, 200 microservices, when you're running
36:42
at a scale of, let's say companies like Pinterest, Instagram, there could be an operator that lets you set up a new microservice and it'll inject these standard pieces like containers, volumes, logging, monitoring, and create a refiner dashboard automatically for you, which are
37:03
rather manual tasks whenever your infrastructure engineer is usually involved. And there are different frameworks and resources to build operators that I discussed about Kubernetes, like KOPF, which is open source by Zalando SE. It is a Python framework.
37:21
There's operator SDK, which is in Golang. Similarly, Golang has multiple resources. It was officially launched for Golang. And then there is this meta controller, which is a Kubernetes plugin that makes it easier to write custom controllers in every language. And as I said before, my aim for this talk was to actually introduce operator back into Python community
37:49
and what all we can do to actually make this community more mature to build the Kubernetes native apps. All right. I have spoken enough, so.
38:07
Okay. Awesome. I'm going to, it's time for Q&A. Okay. So here's the first question. Do you have any front end to see the task process across the scaling activity? Mm-hmm. Task process to see the, no, I don't have, I haven't worked with Minikube dashboard,
38:28
but I'm sure that Minikube dashboard must be providing this thing. So I'm going to, I don't have it right now, but like what a demo I showed you that that is it. And
38:42
I'm going to try it out and we can take this discussion back to our breakout room. Okay. So an anonymous attendee would want you to share your slides and the example code that you shared in GitHub. Yeah. I have, I think I've shared the link
39:02
to repository it's on my, I'm going to post it in my breakout room. So you're more than welcome to just take a look at the code. It's not really production ready, so please bear with that. And there's still some way to go and I'll be working actively to actually improve it.
39:24
Okay. So here's the third question. Is there any problems with failed tasks during downscaling? With failed tasks? No. So I think this downscaling is automatic. So deployment as a whole actually manages automatically like all the tasks, the salary workers are
39:48
only going to pick the tasks when they are ready. And when they are killed immediately, they stop the tasks. And in this case, I think it was fine. Like we saw the flower monitoring that none of the tasks actually failed or none of the workers who are actually killed
40:04
while they were processing the tasks. So I think salary and flyer, it automatically takes care of that. Whenever the worker is killed, it's not killed amidst running a message when it's processing a message. Okay. Awesome. So actually I have a question. So could you maybe tell us
40:27
the difference between having a Helm chart and having a custom operator for the same use case? Why would you go to Helm chart and why would you have an operator? Yeah. All right. That's a great question that I was expecting as well from people. So Helm chart is,
40:42
I would say these two concepts are complimentary. The operator pattern, operator pattern is a software design pattern that talks about actually coding the operational expertise you have. And the Helm chart is more like homebrew for macOS. It's like a package manager for
41:03
Kubernetes. It's going to help you install all those things. But if you were to actually push it to the limits, like a Helm chart was originally designed for just for the package management for the Kubernetes, but it has the capabilities to build an operator. You can
41:24
actually just install an operator by using a Helm chart as well. So these concepts go hand in hand rather than they are not a versus, like it's not a showdown that you should do operators. Yeah. Yeah, it does. Thank you. So yeah. Thank you for that talk, Gautam. I think you
41:44
just demystified Kubernetes and operators for a lot of people including me. So I think we should have a big round of applause for you. So let's do that.