Best Practices for Operators Monitoring and Observability in Operator SDK
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 542 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/61918 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSDEM 2023417 / 542
2
5
10
14
15
16
22
24
27
29
31
36
43
48
56
63
74
78
83
87
89
95
96
99
104
106
107
117
119
121
122
125
126
128
130
132
134
135
136
141
143
146
148
152
155
157
159
161
165
166
168
170
173
176
180
181
185
191
194
196
197
198
199
206
207
209
210
211
212
216
219
220
227
228
229
231
232
233
236
250
252
256
258
260
263
264
267
271
273
275
276
278
282
286
292
293
298
299
300
302
312
316
321
322
324
339
341
342
343
344
351
352
354
355
356
357
359
369
370
372
373
376
378
379
380
382
383
387
390
394
395
401
405
406
410
411
413
415
416
421
426
430
437
438
440
441
443
444
445
446
448
449
450
451
458
464
468
472
475
476
479
481
493
494
498
499
502
509
513
516
517
520
522
524
525
531
534
535
537
538
541
00:00
Operator (mathematics)Software development kitIntegral domainSoftwarePrincipal idealLevel (video gaming)Capability Maturity ModelMetric systemCodeAdditionAutomationGroup actionProcess (computing)Software developerOperations support systemMetric systemCodeOrder (biology)Process (computing)Group actionMereologyState observerLevel (video gaming)Capability Maturity ModelComputer animation
05:09
Scaling (geometry)Operator (mathematics)PlanningPower (physics)CodeReduction of orderSoftware development kitOperations support systemCartesian coordinate systemDependent and independent variablesService (economics)State observerMetric systemCodeLinear regressionShared memoryComputer hardwareUtility softwareLevel (video gaming)Multiplication signOrder (biology)SoftwareGroup actionPlanningError messageType theorySoftware frameworkVideo gameCycle (graph theory)Point (geometry)Software testingFunctional (mathematics)State of matterComputer animation
10:12
Metric systemOperator (mathematics)Metric systemOperations support systemGroup actionWebsitePhysical systemDescriptive statisticsRootCubeLatent heatVisualization (computer graphics)Online help
12:29
Metric systemOperator (mathematics)Execution unitTowerAbsolute valuePattern languageMaß <Mathematik>MetreLengthPower (physics)NumberOrder of magnitudeStandard deviationMetric systemExecution unitConsistencyNumberOrder of magnitude
13:24
Metric systemOperator (mathematics)Maß <Mathematik>Mathematics2 (number)Metric systemNP-hard
14:21
Operator (mathematics)Link (knot theory)InformationMultiplication signExpressionFlow separationState of matterUniform resource locatorPerformance appraisal
15:08
Operator (mathematics)InformationInformationExpected valueMereologyComputer animation
15:54
Operator (mathematics)Link (knot theory)Event horizonMetric systemSoftware testingMoment (mathematics)Flow separationInformationLink (knot theory)Validity (statistics)Physical systemMetric systemNetzwerkverwaltungComputer animation
17:20
CodeKolmogorov complexityMetric systemCodeLogicVector potentialProjective planeProgram flowchart
18:03
CodeLogicHuman migrationMetric systemCodeLogicHuman migrationMetric systemScheduling (computing)Counting
18:56
MaizeOperator (mathematics)BuildingSoftware development kitCodeLogicFlow separationMetric systemGame controllerHuman migrationAutomationExecution unitData structureSoftware testingCode generationSoftwarePrincipal idealIntegral domainMetric systemOperations support systemMultiplication signFlow separationCodeState of matterHuman migrationComputer fileLogicElectric generatorType theoryElectronic mailing listMathematicsData structureComplex (psychology)Finite differenceDescriptive statisticsExecution unitFunctional (mathematics)Decision theoryCartesian coordinate systemSoftware testingOnline helpPoint (geometry)Open setProjective planeFiber bundleInformationSemantics (computer science)Computer animation
28:37
Program flowchart
Transcript: English(auto-generated)
00:05
Hi everyone, and welcome to our talk about operator monitoring and how to do it correctly. My name is Shirley, I work at Red Hat. I'm Jean Villasa, I also work at Red Hat for about one year and a half.
00:25
So today we're going to talk about operator observability, Kubernetes operators. We're going to talk about when to start, the maturity levels of metrics,
00:42
why we want to monitor, what we want to monitor, and the best practices and code examples that we created for it. So when we want to talk about, when should we start to think about the observability for operators?
01:06
You can see here in the chart the lifecycle of creating an operator, which is starting in basic installation, and the most mature step is autopilot. So when do you think we should start thinking about observability for a new operator?
01:27
Anyone? When? From the start. That's correct. Usually Deep Insights talks about metrics, alerts,
01:45
which is being able to monitor your operator fully, and people think maybe we should start thinking about it in full lifecycle, maybe that's the case, but you should pretty much start at the beginning
02:02
because the metrics that you are adding first are usually not the metrics that are for your users. They are internal. There are a few steps for the maturity of metrics. The first step is initial. You start with your operator. You want to understand how it works, if it works correctly,
02:24
so the developers start to add ad hoc metrics. I've been working for a few years on an operator in Red Hat called Qvit, and when I joined the project, it was already in the lifecycle phase, a full lifecycle,
02:47
and when I joined, already a lot of metrics were implemented in this operator. The problem was that the developers that added the metrics didn't follow best practices,
03:02
and a lot of the metrics, it was hard to understand which metrics were ours. It's important to understand that your operator is not the only one inside of the Kubernetes system, so when a user or even other developers want to understand which metrics your operator is exposing,
03:25
it should be easy for them to identify your metrics. So the first step, as I said, is initial. The second step is basic monitoring. You start adding your monitoring, and you're starting to think about your users,
03:41
what they want to understand about your operator, and the third step is you have a process for implementing metrics and new metrics, and you are focused about health and performance for your operator, and the last step is actually autopilot,
04:01
taking those metrics and doing smart actions with them in order to do stuff like auto-healing and auto-scaling for your operator, and this is the part that we are actually on in our operator. So as Shirley said, when we first start, we look very much at the internal metrics for the operators themselves,
04:28
so at this point, we might start, for example, looking at the health of the operator. For example, can it connect to the Kubernetes API, or if it's using external resources, can it connect to those providers' API?
04:42
Is it experiencing any efforts? So we can also start by looking at, for example, its behavior. How often is the operator reconciling? What actions is the operator performing? So this is the kind of stuff that, as we are developing, we are very interested in, but we should start, as Shirley said, thinking more in the future about having these good standards,
05:08
because later we will not be only tracking these, and could also be like resource metrics. And then why should, then, why operator observability, and what are the steps that we'll be taking?
05:25
So starting from the performance and health, here we want to detect the issues that come up early, try to obviously reduce both operator and application downtime, and try to detect some regressions that might happen.
05:42
Also, we can start looking at, for example, planning and billing, to improve planification, to also improve profitability, or then bill users. At this point, we start looking more at infrastructure metrics also. For example, we want to track resource utilization.
06:03
This might be like CPU, memory, disk, and we can also start looking at the health of the infrastructure itself, maybe hardware failures, or trying to detect some network issues. Then we also start looking at, use these metrics to create alerts,
06:22
to send notifications about the problems that come up as early as possible. So we obviously want to take appropriate actions to not let them go around. And after this, at this point, we go into more detail about metrics. Maybe we start looking at application metrics. So what's the availability of our application?
06:42
What's the uptime? What's the error rates? And also, its behavior. What type of request is the application receiving? What types of the responses is sending? And it's important to monitor all of these things. And when we start building up all this information, then at a certain point in time, as Shirley said,
07:02
we'll be able to give this new life to the operator by having the autopilot capabilities, such as autoscaling, autowheeling capabilities, because at this point, if we did everything correctly, you'll be able to know almost all of the states that we are in.
07:24
And we also start looking at functionality metrics. Are we providing the expected functionality to users? For example, checking that application features are working correctly. We want to see if there are any performance or reliability issues by checking service levels
07:43
and that everything is working in the expected way by checking response efforts and the data that is responded. Thank you. Okay, so I hope you are convinced that observability is important.
08:00
If you are in this room, I guess you are. And for the past two years, we've been working on observability on our operator. What's important to understand is that our operator is considered complex. It has a few sub-operators that it's managing, and each sub-operator has its own dedicated team that is maintaining it.
08:28
And having the insight of looking at those teams working on implementing observability each team separately gave us a higher level of the possibility of understanding the pitfalls
08:44
that they all share when implementing monitoring. So we decided to contribute from our knowledge of how to do this correctly in order for others not to fall to the same pitfalls as us.
09:03
So we decided to create best practices and to share with the community our findings. We hope to shorten the onboarding time for others and to create better documentation and to create reusable code for others to be able to use and save time and money, of course.
09:28
So we reached out to the operator framework SDK team to collaborate with them and to publish there our best practices. As you can see here, this is the operator observability best practices.
09:47
The operator SDK itself is the first step when someone wants to create a new operator. It gives them tools how to create it easily, how to build, test the packages, and provides best practices for all steps of the operator life cycle.
10:05
So we found that this was the best place for others to also go for monitoring. And in these best practices, I will now share with you a few examples. It may sound simple, but simple things have a big impact
10:22
both on the users that are using the system and both of the developers that are trying to work with the metrics. So, for example, a naming convention for metrics. One of the things that is mentioned in the document is having a name prefix for your metrics.
10:43
This is very simple action that will help you identify, that will help the developers, the users to identify that the metrics are coming from the specific operator or a company. In this case, you can see that all of the metrics here have a cube root prefix.
11:01
Cube root, as I said, has sub operators. So under this prefix, we also have a sub prefix for each individual operator. CDI, network, and so on. And this is another example which does not have this prefix.
11:24
We can see here a container CPU, for example, prefix, but we can't understand where it's coming from. In this case, it's CDvisor. But if you're a user and you're trying to understand where this metric came from, it's very hard. And also, you cannot search in Grafana, for example,
11:43
for all of the CDvisor metrics together. So that's a problem. Another thing that is mentioned in the best practices is about help text. Each metric has a dedicated place to add the help for this metric.
12:04
And as you can see in Grafana and in other visualization tools, the user will be able to see when hovering on the metrics the description of it. It's very important because if not, you need to go somewhere else to search for it. Also, this gives you the ability to create auto-generated documentation
12:24
for all of your metrics in your site. Another example is the base units. So Prometheus recommends using base units for metrics. For example, you can see here for time, two seconds, not milliseconds.
12:47
Temperature, Celsius, not Fahrenheit. This gives the users a fluent experience. When they are using the metrics, they don't need to do conversions, deviations of the data.
13:02
And they are saying, if you want to use milliseconds, use a floating point number. This removes the concern of magnitude of the number, and Grafana can handle it, and it will still show you with the same precision. But the consistency in the UI and how to use the metrics will stay the same.
13:26
Here you can see an example for metrics that are using seconds. And here we see that each CD are not using it. So this is not as recommended, and we would actually recommend to switch it,
13:43
but they started with milliseconds. Now doing the change will cause issues with the UI that is based on it and everything. So it's a problem to change the names of the metrics once they are created. So when I joined the operator, we didn't have name prefixes.
14:04
I tried to understand which metrics are ours and which are not. It was very hard. So we needed to go and do breaking changes for the metrics and add those prefixes, change the units, and this is what we want others to be able to avoid, this duplicate of work.
14:24
Additional information in the best practices is about alerts. This is an example of an alert. You can see here that we have the alert name. We have an expression which is based on a metric, and once the expression is met, the alert either starts firing
14:44
or is in pending state until the evaluation time. There is a description. There is also a possibility to add a summary. This is the evaluation time. It has a severity and a link to a runbook URL.
15:01
There could be other information that you can add to it, but this is the basic. And what we're saying in the best practices is that they're supposed to be, for example, for the labels of severity, there should only be three valid options, a critical warning and info alerts.
15:23
If you're using something else, it would be problematic. You can see here in this example, I don't know if you're seeing it, but we see that these are examples in the cluster. We have info, warning, and critical, and we have one non-severity, which is the watchdog.
15:41
It's part of Prometheus alerts. It's just making sure that the alerts are working as expected. It should always stay one. There should never be alerts that don't have severity. And this is a bad example of using a severity label. In this case, they are using major instead of critical.
16:02
The impact of that is that if someone is setting up alert manager to notify the support team that something critical happened to the system and they want to get notified by Slack or by a pager, they will miss out on this alert
16:21
because it doesn't meet with the convention of severities, valid values for severities. So what we have at the moment for best practices, we have a metrics naming convention. We have how to create documentation for metrics,
16:41
alerts, information about alert labels, runbooks. By the way, runbooks are a way to provide more information about the alert. You have a link in the alert where you can send the user to go and find more details. What's it about? What's the impact?
17:01
How to diagnose it and how to mitigate the issue? And then additional information about how to test metrics and how to test alerts. We plan to enrich this information, add information about the dashboards, logging, events, tracing in the future.
17:22
So Shirley gave an overview about an eye-level situation about metrics and alerts. But how do we translate some of these best practices into code? So one of the problems that we faced was that logic code and monitoring code were becoming very intertwined.
17:40
Code like this becomes harder to maintain. Obviously, it becomes more difficult in understanding what the code does and to modify it. This leads obviously to longer development times, potential bugs, and it's also more challenging to onboard new team members or to contribute to one of these projects.
18:03
In this specific snippet, there was like 16.4% of monitoring code intertwined with the migration logic code. So what we did was try to refactor this code to try to separate these concerns one from the other.
18:22
In this specific case, we used a Prometheus collector that's just iterating the existing virtual machines migrations, and then it's just pushing the metrics according to the status of the virtual machines, whether they are successful or not,
18:41
or the counts of the pending scheduling and hunting migrations. And obviously, this snippet is much easier to understand how the monitoring is being done, and we take all of these out of the migration logic. And to help other developers that are starting to avoid
19:01
the same mistakes as we had to solve, we are creating a monitoring example in a memcached operator. We already have an initial example that is already thinking about all these concerns, separation between logic code and monitoring code.
19:24
Our idea with this example is to make it as clear as possible, especially, this is especially important when we are working with large and complex code bases, also make it more modular. It's easier to understand both the logic code and the monitoring code
19:44
without affecting each other's functionality in the application in general. Also, make it more usable. For example, the way we are doing monitoring in different operators will always be more or less the same. So if we find a more or less common way to do this,
20:03
it will make it easier to use this code in other applications and projects, which will save them time and effort. And also, it will become more performant. If we mix all the monitoring concerns with the migration code,
20:24
it's trivial that the time it will take to make a migration will take longer because we are calculating metric values and doing some Prometheus operations while we are trying to calculate the state of a migration. So having this separation will also help these questions.
20:46
Our idea for the structure of the code will be by creating a package. And for example, here we can see a migrations example, a central place where we will be registering all migrations
21:03
and all migrations, sorry, no, all metrics, obviously, and then we will have files that will separate these metrics by their types. For example, in this example, we can see one operator metrics file, which will have all the operator-related metrics,
21:23
as we talked in the beginning, and then we could have one specific file only for the migration metrics and then register them in one place. And why do we think about this structure and what benefits could this bring us?
21:41
The first one is to automate the metric and the alert code generation. As we saw, much of the work that the developer need to do that, it's like creating a file with a specific name, then go to the metrics.go file and register that file there.
22:02
So this is very structured and always the same. It will be easier to automate and then allow developers to have a command line tool to generate new metrics and generate new alerts easier. We are also looking forward to create a linter for the metrics name.
22:23
As Shirley said, a lot of the concerns that happen when operators are becoming more advanced is looking back at the metrics and see everything we did wrong with their naming, and even, as she said, it's a simple change but can have a lot of impact,
22:41
so a linter that follows all these conventions would also be important. Also, automate the metric documentations. We are already doing this, and one thing that we faced was that a lot of metrics were very scattered in the code, so it was easy to automate and find all of them,
23:03
and with a structure like the previous one, it will be even more easier to create a full list of metrics and their description that will help both developers, newcomers, and users, and lastly, have an easier structure for both unit and end-to-end testing
23:24
because if we have this clear structure for where the metrics are, we can add tests there and test exactly those functions and not call the intertwined in logic code.
23:41
And thank you. And just to conclude, if you are starting to create an operator or if you already have an operator, we invite you to go and to look at the operator SDK, to look at the best practices, to try to avoid the pitfalls that we had, and I really hope it will help you, and you should really just consider that when you're creating a new operator,
24:04
it starts small, but it can become really robust, and you cannot tell that in the beginning, so think ahead and try to build it correctly from the beginning. I hope it will be helpful for you, and thank you. Thank you.
24:33
Thank you for your talk. Do you have any recommendations on how you would log out the decision points within your operator,
24:41
so if you wanted to retrospectively see why it's done certain things? I'm not sure I understand. Recommendations on? Like the decision points, how it's decided which Kubernetes API calls to make. If your operator did something crazy
25:02
and you wanted to look back and see why it did that, is there anything you would do in advance to the logging? I think the summary of what we've learned is in these documents, because, for example, as I said,
25:20
for example, the developers that started this project, they didn't have where to go and the best practices of how to name a metric, so they just named it how they felt. They did follow the Prometheus recommendations, but having a prefix of the operator has a big impact for the users,
25:43
and not even the users. When we are trying to understand how to use internal metrics for our uses, we also are struggling to understand where a metric came from, where is the code for it, so all of the summary of what we've learned is in those documents,
26:01
and we plan to enrich it even further. Yeah, thank you for your talk. It was very interesting. You mentioned code generation for the metrics package. My question is, do you plan on adding that to kubebuilder and the operator SDK?
26:25
Yeah, basically we are working on operator SDK right now because we want to build all these tools, and we are thinking about them, but obviously this needs a lot of help from the community, and I am saying this because I'll enter a personal note here and an idea,
26:46
because the way I see it is like on kubebuilder and on operator SDK, being able to just go there, and you say that you want to generate a project with monitoring, and it creates the monitoring package, or if the operator already exists,
27:02
you have a command to generate the monitoring package, and then on kubebuilder you use it to create an API or a controller, you'll have a similar command but to create a new metric, and you pass the type of the metric, the help, and the same for alerts. At least that's the way I see it, and for me it makes sense.
27:24
I agree. Thank you. Thank you for your talk. How much of the conventions that you talked about align with open telemetry, semantic conventions?
27:43
How much are what? Most of them are aligned with open telemetry, actually, but these are specific for operators. That's the idea. The idea is that you have a central place where you can get the information, and by the way, if someone is creating a new operator and has an insight,
28:02
we encourage others to contribute to the documentation and teach others and share the information. Basically, I think we align with the open telemetry conventions, but we have more ideas to operate them.
28:29
I think that's it. Thank you. Thank you.