How We Gained Observability Into Our CI/CD Pipeline
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Untertitel |
| |
Serientitel | ||
Anzahl der Teile | 542 | |
Autor | ||
Lizenz | CC-Namensnennung 2.0 Belgien: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/61539 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
FOSDEM 2023217 / 542
2
5
10
14
15
16
22
24
27
29
31
36
43
48
56
63
74
78
83
87
89
95
96
99
104
106
107
117
119
121
122
125
126
128
130
132
134
135
136
141
143
146
148
152
155
157
159
161
165
166
168
170
173
176
180
181
185
191
194
196
197
198
199
206
207
209
210
211
212
216
219
220
227
228
229
231
232
233
236
250
252
256
258
260
263
264
267
271
273
275
276
278
282
286
292
293
298
299
300
302
312
316
321
322
324
339
341
342
343
344
351
352
354
355
356
357
359
369
370
372
373
376
378
379
380
382
383
387
390
394
395
401
405
406
410
411
413
415
416
421
426
430
437
438
440
441
443
444
445
446
448
449
450
451
458
464
468
472
475
476
479
481
493
494
498
499
502
509
513
516
517
520
522
524
525
531
534
535
537
538
541
00:00
Luenberger-BeobachterWechselsprungFließgleichgewichtGrenzschichtablösungComputerspielPhysikalisches SystemSoftwaretestApp <Programm>SoftwareentwicklerVerzweigendes ProgrammComputeranimation
02:19
ExistenzaussageUmwandlungsenthalpieWort <Informatik>Quick-SortProgramm/QuellcodeComputeranimation
02:57
Offene MengePunktwolkeLuenberger-BeobachterOffene MengeHauptidealDatenverwaltungOpen SourceSystemplattformLoginProdukt <Mathematik>NeuroinformatikSoftwareentwickler
04:03
MathematikFrequenzMetrisches SystemMathematikMultiplikationsoperatorBitrateDreiecksfreier GraphVerschlingungQR-CodeProdukt <Mathematik>BitCASE <Informatik>FrequenzTouchscreenMinimum
05:22
Luenberger-BeobachterExpertensystemMultiplikationsoperatorCASE <Informatik>InformationProjektive EbeneGüte der AnpassungVerzweigendes ProgrammGraphFilter <Stochastik>Zentrische StreckungComputeranimation
06:28
SichtenkonzeptSichtenkonzeptMusterspracheIdentifizierbarkeitVerkehrsinformationTwitter <Softwareplattform>SoftwaretestProjektive EbeneProgrammfehlerVerzweigendes Programm
07:17
Explosion <Stochastik>Elastische DeformationLuenberger-BeobachterOffene MengeKeller <Informatik>Open SourceVerkehrsinformationPunktTermVisualisierungInformationsspeicherungComputeranimation
07:59
Virtuelle MaschineSchedulingTypentheorieZahlenbereichProgrammierumgebungVerzweigendes ProgrammInformationTouchscreenVirtuelle MaschineGebäude <Mathematik>VariableComputeranimation
08:35
Virtuelle MaschineVirtuelle MaschineServerInformationsspeicherungGamecontrollerInformationOffene MengeInformationsüberlastungZahlenbereichDefaultExpertensystemElastische DeformationComputeranimation
09:59
Verzweigendes ProgrammVirtuelle MaschineVisualisierungGebäude <Mathematik>Offene MengeComputeranimation
10:47
Explosion <Stochastik>GraphBildschirmfensterUmwandlungsenthalpieMultiplikationsoperatorSichtenkonzeptTabelleComputeranimation
11:11
Explosion <Stochastik>VisualisierungCASE <Informatik>Virtuelle MaschineMultiplikationsoperatorInstantiierungCodeProgrammfehlerComputeranimation
12:02
Explosion <Stochastik>MultiplikationsoperatorVisualisierungVideo GenieVerschlingungKonditionszahlProgrammierumgebungDatenfeldAbfrageSoftwareentwicklerPhasenumwandlungVerkehrsinformationLuenberger-BeobachterSummierbarkeitElastische DeformationPhysikalisches SystemOffene MengeComputeranimation
13:58
SoftwaretestInformationBitratePhysikalisches SystemMittelwertZählenMultiplikationsoperatorOffene MengeProzess <Informatik>Ordnung <Mathematik>Elastische DeformationVerzweigendes ProgrammComputeranimation
14:56
GammafunktionTrennschärfe <Statistik>Elastische DeformationPunktVerkehrsinformationOffene MengeMusterspracheDifferenteVisualisierungSoftwaretestTabelleMultigraphExtreme programmingSichtenkonzeptStatistikComputeranimation
15:39
Explosion <Stochastik>VerkehrsinformationProgrammierumgebungVariableComputeranimation
16:17
VisualisierungMultiplikationsoperatorUmwandlungsenthalpieOrdnung <Mathematik>AblaufverfolgungWort <Informatik>Prozess <Informatik>Computeranimation
17:04
Kette <Mathematik>CASE <Informatik>Kontextbezogenes SystemFehlermeldungFront-End <Software>ComputerarchitekturRechter WinkelSichtenkonzeptProgrammierumgebungMultiplikationsoperatorPhysikalisches SystemSystemaufrufFächer <Mathematik>Ablaufverfolgung
18:15
Offene MengeProjektive EbeneMetrisches SystemSystemplattformPunktwolkeLuenberger-BeobachterVerschlingungProzess <Informatik>App <Programm>LoginNeuroinformatikQuick-SortAblaufverfolgungOpen SourceMultiplikationsoperatorInformationsspeicherungQR-CodeTypentheoriePunktStandardabweichungWort <Informatik>Computeranimation
20:08
Elektronischer ProgrammführerKonfigurationsraumInformationVerschlingungElektronischer ProgrammführerDateiformatOpen SourceFront-End <Software>Produkt <Mathematik>Offene MengeAuswahlaxiomTermComputeranimation
21:05
Explosion <Stochastik>TopologieGraphVisualisierungEinsGeradeGlobale OptimierungSichtenkonzeptCASE <Informatik>Parallele SchnittstelleStatistikRechter WinkelMultiplikationsoperatorDatenstrukturFolge <Mathematik>AblaufverfolgungMereologieComputeranimation
22:12
Element <Gruppentheorie>AdditionKontextbezogenes SystemGebäude <Mathematik>AblaufverfolgungBlackboxComputeranimation
22:56
ProgrammierumgebungPhysikalisches SystemMetrisches SystemServerDatenflussCodeComputeranimation
23:20
ServerBenutzerbeteiligungDateiformatKonfigurationsraumPlug inAuswahlaxiomEin-AusgabeMetrisches SystemOpen SourceProdukt <Mathematik>Computeranimation
24:07
Metrisches SystemClientRechter WinkelDateiformatInformationsspeicherungATMPhasenumwandlungDefaultPlug inFunktion <Mathematik>Front-End <Software>Metrisches System
25:04
SCSISynchronisierungGebäude <Mathematik>LastPhysikalisches SystemZahlenbereichVerzweigendes ProgrammHalbleiterspeicherMini-DiscVirtuelle MaschineTypentheorieBefehlsprozessorComputeranimation
25:43
Chi-Quadrat-VerteilungAppletWarteschlangeZählenSpeicherverwaltungThreadSpeicherbereinigungMetrisches SystemImplementierungProzess <Informatik>MultiplikationsoperatorVisualisierungTypentheorieComputeranimation
26:17
Metrisches SystemLuenberger-BeobachterKette <Mathematik>Produkt <Mathematik>Elastische DeformationOffene MengeComputeranimation
26:45
SichtenkonzeptSoftwaretestVerzweigendes ProgrammDifferenteVerkehrsinformationMultiplikationsoperatorServerGamecontrollerGrenzschichtablösung
27:15
Explosion <Stochastik>Metrisches SystemDreiecksfreier GraphOpen SourceInformationMultiplikationsoperatorOffene MengeMathematikSoftwareentwicklerComputeranimation
27:54
MAPElektronischer ProgrammführerLuenberger-BeobachterMusterspracheMetrisches SystemCodeProzess <Informatik>MultiplikationsoperatorTermMulti-Tier-ArchitekturSoftwareentwicklerMereologieInstantiierungGrenzschichtablösungVirtuelle MaschineDatenstrukturTypentheorieFrequenzZahlenbereichZeitreihenanalyseProgrammierumgebungMetropolitan area networkRauschenAggregatzustandBitSystemaufrufSummengleichungComputeranimation
31:52
Flussdiagramm
Transkript: Englisch(automatisch erzeugt)
00:08
So, I hope it will be fun enough for you to wake up at the end of the day and Very excited to be here at FOSDEM and specifically the CI CD Devroom. And today I'd like to share with you about
00:21
How we gained observability into our CI CD pipeline And how you can do too So let's start with a day in the life of a DoD, a developer on duty, at least in my company And it goes like that So the first thing that DoD does in the morning, at least it used to be before we did this exercise
00:43
Is going into the Jenkins. We worked with Jenkins But the takeaways by the way will be very applicable to any other system you work with so Nothing too specific here Getting into Jenkins at the beginning of the morning. We're looking at the
01:01
The status there, the pipelines for the last few hours over the night and of course checking if anything is red And most importantly if there's a red master And if you can obviously finish your coffee or jump straight into the investigation And to be honest, sometimes people actually forgot to go into the Jenkins and check this. So that's another topic we'll maybe touch upon
01:29
So you go in and then you need to go let's say you see a failure or see something red you need to start going one by one on the different runs and Start figuring out understanding what failed, where it failed, why it failed and so on and
01:46
It's important that you actually you needed to go one by one on the different runs And we have several runs. We have the backend. We have the app We have smoke tests several of these and start getting the picture getting the pattern across and understanding
02:01
across runs, across branches What's going on? And on top of all of that It was very difficult to compare with historical behavior with the past behavior to understand what's an anomaly What's the steady state for these days and so on? So and just to give you a few examples of questions that we found it
02:23
difficult or time-consuming to answer things such as Did all runs fail on the same step? Did all runs fail for the same reason? Is that on a specific branch? Is that on a specific machine? If something's taking longer, is that normal? Is that anomalous? What's the benchmark?
02:46
So these sorts of questions it took us too long to answer and We realized we need to improve. A word about myself, my name is Dotan Horvitz I'm the principal developer advocate at a company called logs.io
03:04
logs.io provides a cloud native observability platform that's built on popular open source tools such as you probably know Prometheus OpenSearch, OpenTelemetry Jaeger and others. I come from a background as a developer
03:22
Solutions architect even a product manager and most importantly I'm an advocate of open source and communities I Run a podcast called open observability talks about open source DevOps Observability, so if you're interested in these topics and you like podcast to check it out. I also run
03:43
organize Co-organize several communities the local chapter of the CNCF the cloud native computing foundation in Tel Aviv Kubernetes community days DevOps days Etc and you can find me everywhere at Horvitz. So if you have something interesting you tweet feel free to tag me
04:02
so before I get into how we improved our CICD pipeline or capabilities, let's first understand what We want to improve on and actually I see very often that people jump into solving before really understanding the metric the KPI that they want to improve and
04:24
Very basically, there are four primary metrics for Let's say DevOps performance and you can see there on the on the screen There's the deployment frequency lead time for changes Change failure rate and NPTR mean time to recovery. I
04:44
Don't have time to go over all of these but very important So if you if you're new to this and if you want to read a bit more about that I left a QR code and a short link for you at the bottom for a 101 on the Dora metrics Do check it out. I think it's priceless
05:02
and In our case we needed to improve on the lead time for changes or sometimes called cycle time Which is the amount of time it takes a commit To get into production which in our case was the time was too too long too high and was holding us back
05:21
so We are experts at the observability in you know in my engineering team That's what we do for a living. So it was very clear to us that what we're missing in our case is observability into our CICD pipeline and To be fair with Jenkins and there are lots of things to complain about Jenkins
05:42
But there is some some capabilities within Jenkins. You can go into a specific pipeline run You can see the different steps. You can see how how much time an individual step took Using some plugins. You can also visualize the graph and we even wire Jenkins to get alerts on slack
06:02
But that wasn't good enough for us and the reason that we wanted to find a way to monitor Aggregated and filter the information according to our own time scale according to our own filters Obviously to see things across branches across runs
06:20
To compare with historical data with their own filtering So that's where we aimed at and we launched this internal project with these requirements for requirements one first and foremost as We need the dashboard we need dashboard with aggregated views to be able to see the aggregated
06:40
data across pipelines across runs across branches as we talked about Secondly, we want to do have access to historical data to be able to compare to understand trends to identify patterns anomalies and so on Thirdly we wanted reports and alerts to be able to automate as much as possible and
07:04
lastly we wanted some Ability to view flaky tests test performance and to be able to understand their impact on the pipeline so that was the project requirements and How we did that Essentially takes four steps
07:23
Collect store Visualize and report And I'll show you exactly how it's done and what each step entails in terms of the tech stack We were very versed with the elk stack elastic search cabana Then we also switched over to open search and open search dashboards after elastic
07:44
Relicenced and it was no longer open source. So that was our natural point to start our Observability journey and I'll show you how we did these four steps with this tech stack So the first step is collect and for that we instrumented the pipeline
08:02
To collect all the relevant information and put it in environment variables Which information you can see some examples here on the screen the branch the kamicha The machine IP the run type with its schedule triggered by merge to master or something else
08:21
failed step step duration build number anything essentially that you find useful for investigation later my recommendation collected and Persisted so that's the collect phase And after collect comes store and for that we created a new summary step at the end of the pipeline one
08:43
Where we ran a command to collect all that Information that we did in the first step and created the JSON and Persisted it to elastic search as I mentioned then open move to open search And it's important to say again for the fairness of Jenkins and for the Jenkins experts here Jenkins does have some built-in
09:06
Persistency capabilities and we try them out But it wasn't good enough for us and the reason is that by default Jenkins Essentially keeps all the bills and stores them on the Jenkins machine
09:20
which burdens these machines of course and Then you start needing to limit the number of bills and the day the duration how many days and so on and so forth. So That wasn't good enough for us. We needed to A more powerful access to historical data. We wanted to persist historical data in our own
09:40
In our own control the duration the retention and most importantly off of the Jenkins servers so as not to Risk and overload the critical path So that's about store and after store once we have all the data in elastic search or open search Now it's very easy to build command dashboards or open search dashboards and visualizations on top of that
10:07
And Then comes the question. Sorry Then comes the question. Okay, so which visualizations should I build and For that and that's a tip take it with you go back to the pains go back to the questions that you found it
10:22
hard to answer and This would be the starting point So if you remember before we mentioned things such as did all runs fail on the same step Did all runs fail for the same reason? How many fail these at the specific branch that the specific machine and so on These are the questions that we guide you then to choose the right
10:42
Visualizations for your dashboard and I'll give you some some examples here so Let's start with the top-line view you want to understand the health of your house table your pipeline is so Visualize the success and failure rates. You can do that overall in general or at a specific time window
11:02
On a graph very easy to see the first glance. What's the health status of your of your pipeline you want to find problematic steps then visualize failures segmented by Pipeline steps again very easy to see the spiking step there
11:22
You want to detect problematic build machines visualize failures segmented by machine and that by the way, I Saved us a lot of wasted time going and checking bugs in the release code When we saw such a thing, we just go you kill the machine you let the auto scaler spin up a new instance
11:42
And you start clean and in many cases it solves the problem. So lots of time saved this in general this aspect of code Based or environmental based the issues is definitely a challenge. I'm assuming not just for me So I'll get back to that soon
12:02
Another example duration per step again very easy to see where And the time is spent So that's the visualize part and after visualize comes the reporting and alerting phase And if you remember before the DoD the developer on duty needed to go manually and check
12:22
Jenkins and then the health check now the DoD gets Start of day report directly to slack and actually as you can see the report already contains the link to the dashboard and even a snapshot of the dashboard embedded within the
12:41
Slack so that at the first glance even without going into the dashboard You can see if if you can finish your coffee or if there's something alerting that you need to click that Link and go start investigating And of course, it doesn't have to be a scheduled report It could be also you can define triggered alerts On any of that the the fields the the data that we collected in the first phase and the collect phase
13:05
So and you can do any complex queries that or conditions that you want you want to do something like if The sum of failures goes above X or the average duration goes above Y trigger an alert So essentially anything that you can formalize as a Lucene query
13:22
You can automate as an alert and that's some alerting layer that we built on top of elastic search and open search for that One last note, I'm giving the examples from slack because that's what we use in our environment but you're not limited obviously to slack you have
13:40
Support for many notification aid and endpoints depending on your systems pager duty Victor all stops genie MS teams, whatever We personally work with slack so that the examples are with with slack So that's how we build Observability into the Jenkins pipelines, but as we all know, especially here in the CI CD def room Jenkins
14:03
CI CD is much more than just Jenkins. So what else? So we wanted to analyze if you remember the original requirements to analyze flaky tests and test performance and following the same process collecting all the
14:22
Relevant information from test run and storing it in elastic search and open search and then Creating a cabana dashboard or open search dashboards and as you can see very All the relevant usual suspects that you'd expect the test duration fail tests flaky test
14:41
failure count and rate moving averages fail tests by branch over time all of the things that you would need in order to Analyze and understand the impact of your test and the flaky tests in your in your system And similarly after visualize you can also report we created
15:01
Reports to slack we have a dedicated selection for that following the same pattern One important point is about the openness So once you have the data in open search or elastic search It's very easy for different teams to create different visualizations on top of that same data
15:21
so I took another extreme a different team that Didn't like the graphs and preferred the the table views and the counters To visualize again very similarly test Test stats and so on and that's the beauty of it
15:40
So just to summarize we instrumented Jenkins pipeline to collect Relevant data and put it in environment variables then at the end of the pipeline We created a JSON with all this data and persisted it to elastic such open search Then we created cabana dashboards on top of that data And lastly we created reports and alerts on that data. So four steps collect store
16:07
visualize and report So that was our first step in the journey, but we didn't stop there The next step was We asked ourselves. What can we do? in order to investigate performance of
16:23
specific pipeline runs so you have a run that takes a lot of time you want to optimize but where is the problem and That's actually what distributed tracing is ideal for How many people know what distributed tracing is with a show of hands?
16:40
Okay, I see that most of us there are a few that know so maybe I'll say a word about that soon Very importantly Jenkins know that Jenkins has the capability to emit Trace data spans just like it does for logs. So it's already built in so we decided to visualize jobs and pipeline executions as distributed tracing that was the next
17:04
step and For those who? Don't know this is what the tracing essentially helps pinpoint where issues occur in where Latency is in production environments in distributed systems It's not specific for CI CD
17:21
if you think about the microservice architecture and a request coming in and flowing through a chain of interacting microservices then When something goes wrong you get an error on that request You want to know where the error is within this chain or if it's there's a latency you want to know where the latency is That's this of the tracing in a nutshell and the way it works
17:41
Is that each step in this call chain or in our case each step in the pipeline? creates and emits a span you can think about the span as a structured log that also contains the trace ID the Start time the duration and some other context and then there is a back end that collects all these fans
18:01
Reconstruct the trace and then visualizes it typically in this Timeline view or Gantt chart that said that you can see on the On the right hand side so now that we understand that this with the tracing. Let's see how we add this is with the tracing type of
18:21
performance pipeline performance into a CI CD pipeline and same process first step collect and For the collects app we decided to use open telemetry collector Who who doesn't know about open telemetry who doesn't know the project that's why they have a background, okay
18:43
I have a few a few so I'll say a word about that And anyway, I added the link you see a QR code in the link at the lower corner there For a beginner's guide to open telemetry that I wrote I gave a talk about open telemetry at kubecon Europe, so You'll find useful but very briefly
19:04
It's an observability platform for collecting logs metrics and traces, so it's not specific only to traces in an open unified standard manner It's it's an open source project under the CNCF the the cloud native computing foundation
19:22
And at the time It's a fairly young project, but at the time the tracing piece of open telemetry was already GA generally available So we decided to go with that today by the way also metrics is soon to be GA. It's already in release candidate And logging is still not there
19:41
So what do you need to do if you choose the open telemetry you need to set up the open telemetry collector It's sort of an agent for each to send you need to install the Jenkins open telemetry Plug-in very easy to do that on the UI and then you need to configure the Jenkins open telemetry plug-in To send the open to the open telemetry collector and point over
20:02
OTL P over gRPC protocol, that's the collect phase and after collect comes store For the back end we used Jaeger Jaeger is a also a very popular open source under the CNCF for specifically for distributed tracing
20:21
And we use Jaeger to monitor our own production environment, so that was our natural choice also for this we also have a Jaeger based service, so we just use that but anything that I show here actually you can use with any Jaeger distro whichever one used managed or self-serve
20:42
So and if you do run your own by the way I added the link on how to deploy Jaeger on Kubernetes in production, so you have a link there as Short link that I added very useful guide So what you need to do you need to configure open telemetry collector to send the To export in in open telemetry collector terms to export to Jaeger in in in the right format
21:05
All the aggregated information and once you have that Then you can visualize the visualize part is much easier this case because you have a Jaeger UI with predefined Dashboard you don't need to start composing Visuals essentially what you can see here on the on the
21:22
On the left hand side you can see this indented tree structure and then on the right the gun chart each line here is a span and It's very easy to see the pipeline sequence It's the text is very is a bit small But you can see for each step of the pipeline you can see the duration How much it took you see which ones ran in parallel and which ones ran sequentially if you have a very long latency
21:48
On the overall you can see where most of the time is being spent where the critical path where you best optimize and so on and by the way, Jaeger also offers other views like
22:00
Recently added the flame graph and you have Trace statistics and graph view and so on so but this is what people are used to so I'm showing the the timeline view So that's on Jaeger and of course as we said before CICD is more than just Jenkins so what we can do beyond just Jenkins
22:21
And what you can do is actually to instrument additional pieces like Maven Ansible and other elements to get final granularity into your Traces and steps so for example here the things that you see in yellow is Maven build steps So what before used to be one black box span in the trace suddenly?
22:42
Now you can click open and see the different build steps each one with its own duration It's one each one with its own context and so on so That's in a nutshell how we added tracing to our CICD pipeline The next step is as we as I mentioned before many of the pipelines actually failed not because of the released code
23:03
But because of the CICD environment So we decided to monitor metrics from the Jenkins servers and the environment It goes to the system the containers the JVM essentially anything that could break irrespective of the released code and following the same flow so the first step collect we
23:24
Used a telegraph we use that in production so use that here as well. That's an open source by influx data And essentially you need two steps you need to first enable a Configure sorry Jenkins to expose metrics in Prometheus format we work a lot with Prometheus for metrics, so that was our natural choice
23:48
and That's a simple configuration the Jenkins web UI And then you need to install telegraph if you don't already have that and then make sure that it Configure it to scrape the metrics off of the Jenkins server in a using the Prometheus
24:03
Input plug-in okay, so that's the first step the second step is On the store side as I mentioned we use Prometheus for metrics, so we use that as well here We even have our own managed Prometheus, so we use that but anything that I show here is
24:20
identical whether you use Prometheus or any Prometheus compatible backend And Essentially you need to configure telegraph to send the metrics to Prometheus And you have two ways to do that you can do that in pull mode or in push mode So pull mode is the default for for Prometheus essentially when you configure a telegraph to expose a slash
24:42
metrics endpoint and then It can be Exposed for Prometheus to scrape it from if you want to do that you use the Prometheus client output plug-in Or if you want to do it in push mode Then you use the HTTP output plug-in just a important note make sure that you set the data format to
25:00
Prometheus remote, right So that's the store phase and then once you have all the data in Prometheus Then it's very easy to create cabana dais or yeah, Grafana dashboards on top of that and I gave some examples here you can filter of course by build type by branch machine ID build number and so on and
25:20
You can monitor in this example. This is a system monitoring so CPU memory disk usage load and so on you can monitor the Docker container like the CPU IO inbound outbound disk usage obviously the running stopped paused containers by Jenkins machine everything that you'd expect and
25:44
JVM metrics being a Java implementation thread count heap memory garbage collection duration Things like that. You can even of course monitor the Jenkins nodes queues executors themselves So again, you have an example dashboard here You can see the queue size that was break down the Jenkins jobs the count executed over time
26:06
Break down by job status and so on So this is the types just to obviously lots of other visualizations you can create and you can also create alerts I won't show that in the lack of Time so just to summarize what we've seen
26:23
Treat your CI CD the same as you treat your production for your production use whatever elastic search open search Grafana to monitor to create observability do the same with your CI CD pipeline and Preferably leverage the same stack the same
26:41
Tool chain for the for that and don't reinvent the wheel That was our journey As I mentioned we wanted dashboards and aggregated views for several to see Several pipeline across pipelines across different run branches over time and so on We wanted historical data and controlled persistence Off of the Jenkins servers to determine the duration the retention of that data
27:05
We wanted reports and alerts to automate as much as possible And lastly we want to test performance flaky tests and so on that you saw how we achieve that Four steps if there's one thing to take out of that talk take this one collect store
27:21
visualize and report alert and And What we gained just to summarize Significant improvement in our lead time for changes in our cycle time if you remember the Dora metrics at the beginning On the way we also got an improved developer on duty experience
27:42
much less of a suffer there It's based on open source very important. We're here on their host them so based on open search open telemetry Jaeger Prometheus telegraph You saw the stack if you want more information You have here a cure code for a guide to see I see the observability that I wrote so you're welcome to take a shoulder
28:03
Bitly short link and read more about this, but this was very much in a nutshell Thank you very much for listening. I'm doton horvitz and enjoy the rest of the conference I don't know if we have time for questions
28:20
No, so I'm here if you want questions, or if you want a sticker and made the open source be with you Thank you We have time for questions if there are any Questions you want we can just see for a few minutes. Is that a question?
28:46
Thanks, so have you considered like Persistence how long do you store your metrics in your traces? Have you wondered about that like for how long at a time you store your metrics? So we have that's that was part of the original challenge when we use the Jenkins persistence because when you persist it on the nodes
29:04
Themselves and obviously you're very limited. There's the plug-in that you can configure per days or per number of bills and so on When you do it off of that the critical path you have much more room to maneuver And then it depends on the amount of data you collect we started small so we collected that we for longer periods
29:23
The more it came with the appetite grew and people wanted more and more types of metrics and time series data So we needed to be a bit more conservative But it's very much dependent on the on the you know your practices in terms of the data And the other question was more about like the process so it iterative you you explained it
29:44
Iterative is the best because it really depends you need to learn the patterns of your data consumption the telemetry and then you can optimize to The balance between having the observability and not overloading in over prior over care cost right thank you very very interesting Thank you. There was another question in the back. Yeah
30:00
Thank you, so what was the most surprising insight that you've learned good or bad, and how did you react I? Think I was most surprised personally about the amount of failures that occur because of the environment and what kinds of things and How simple it is to just kill the machine kill the instance Let the autoscaler spin it back up and you save yourself a lot of hassle and a lot of waking people up at night
30:25
So that was astonishing how many things are irrespective of the code and just environmental And we we took a lot of learnings out there to make their environment more robust to get people to clean after them to Automate the cleanups and things like that. That's that's for me was insightful Thank you
30:41
Any other questions Then I have one last one. Sorry. No noise My question is who usually the who are usually the people looking at the dashboard because I mean I maintain a man a lot Of dashboard in the past and sometime I had a feeling that I was the only one looking at those that work So I'm just wondering if you identify a type of people really benefit from those dashboard
31:00
So it's a very interesting or a question because we also learned and we changed the the org structure Several times so it moves between dev and DevOps. We now have a release engineering team So they are the main stakeholders to look at that. But this dashboard is the gold as I said the developer on duty so everyone that is now on call needs to see that that's for sure and
31:25
There's the tier 2 tier 3 So let's say the the chain for for that you also use that as a high level also by the team leads And in the developer side of things so these are the main stakeholders and depending on if it's the critical part of the developer on duty and the tiers or if it's
31:41
The overall thing the health state in general by the release engineer Thank you very much everyone