We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Our road to a k8s/GKE based Closed Build Environment

00:00

Formal Metadata

Title
Our road to a k8s/GKE based Closed Build Environment
Subtitle
A small journey to an autoscaling build env based on Jenkins.
Title of Series
Number of Parts
490
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
My team build a new Closed Build Environment for building Release Packages on Google Cloud Platform(gcp) with Google Kubernetes Engine (GKE). I like to take you on a small journey through a variety of topics we came across (open for change): How we bootstrap and how we use ArgoCD Autoscaling to 100 Build nodes for building Why we are using Prometheus-Operator SRE or how we maintain our stack Product aspect Base Image building & scanning Network setup with Shared VPC Google Cloud Platform IAM Permissions vs. RBAC Specific GKE Features like Workload Identity And others Simple real live example how my team is doing it. Looking forward to inspire and to get feedback from others!
33
35
Thumbnail
23:38
52
Thumbnail
30:38
53
Thumbnail
16:18
65
71
Thumbnail
14:24
72
Thumbnail
18:02
75
Thumbnail
19:35
101
Thumbnail
12:59
106
123
Thumbnail
25:58
146
Thumbnail
47:36
157
Thumbnail
51:32
166
172
Thumbnail
22:49
182
Thumbnail
25:44
186
Thumbnail
40:18
190
195
225
Thumbnail
23:41
273
281
284
Thumbnail
09:08
285
289
Thumbnail
26:03
290
297
Thumbnail
19:29
328
Thumbnail
24:11
379
Thumbnail
20:10
385
Thumbnail
28:37
393
Thumbnail
09:10
430
438
Sign (mathematics)Point cloudTwin primeProduct (business)Integrated development environmentSpacetimeControl flowCalculationProduct (business)Mixed realityComputer animation
BuildingClosed setIntegrated development environmentContext awarenessModulare ProgrammierungContext awarenessView (database)Physical systemOrder (biology)SoftwareProof theoryLatent heatInformation securityShift operatorBuildingIntegrated development environmentMoment (mathematics)Student's t-testBitComputer animation
Instance (computer science)Shift operatorOpen sourcePlug-in (computing)Instance (computer science)Integrated development environmentProduct (business)MereologyDecision theoryData managementChainBeat (acoustics)Configuration spaceBuildingFigurate numberOrder (biology)Information securityComputer animation
Vertex (graph theory)MiniDiscData centerData managementDecision theoryState of matterScaling (geometry)Ocean currentExecution unitVirtual machineInstance (computer science)Doubling the cubeService (economics)Sound effectProcess (computing)BefehlsprozessorComputer configurationComputer animation
Vertex (graph theory)Revision controlPublic key certificateEncryptionOpen setOperating systemDifferent (Kate Ryan album)Point (geometry)MathematicsVirtual machinePublic key certificateTerm (mathematics)RoutingGoogolService (economics)Scaling (geometry)Cloud computingInheritance (object-oriented programming)Computer animation
Declarative programmingCodeTemplate (C++)Complete metric spaceNamespaceData storage deviceCodeBackupPoint cloudHacker (term)WritingIntegrated development environmentSound effectUltraviolet photoelectron spectroscopy
State of matterElectric currentVirtual machineEmailHydraulic jumpMathematicsEnterprise architectureMultiplication signHuman migrationRevision controlJava appletCuboidSoftware developerComputer animation
Service (economics)Interface (computing)Information securityDefault (computer science)WindowSoftware maintenanceBoundary value problemProduct (business)Product (business)Boundary value problemDependent and independent variablesInformation securityMusical ensembleFocus (optics)Electronic mailing listSoftware maintenanceWindowBuildingDefault (computer science)Row (database)Data miningInformationComputer animation
GoogolPoint cloudService (economics)Point (geometry)Product (business)Web pagePhysical systemArithmetic progressionBitSoftware maintenanceWindowPoint (geometry)Link (knot theory)Product (business)WikiLogical constantComputer animation
SpacetimeSocial classNamespaceBitSoftware testingDefault (computer science)SpacetimeSoftware developerDifferent (Kate Ryan album)Computer animation
Operator (mathematics)SoftwareBefehlsprozessorMiniDiscLoginVertex (graph theory)Operator (mathematics)Right angleGene clusterAlpha (investment)Context awarenessNamespaceProduct (business)Projective planeData managementData loggerComputer fileMechanism designFitness functionScaling (geometry)Forcing (mathematics)Social classCommutative propertyCommutatorSemiconductor memoryBitCASE <Informatik>Office suite
Interface (computing)Design by contractRevision controlAuthenticationNamespaceCubeMathematicsSummierbarkeitVirtual machineChainInterface (computing)Quantum stateProjective plane40 (number)Semiconductor memorySoftware testingCartesian coordinate systemMobile appComputer fileDesign by contractInteractive televisionBefehlsprozessorSource code
Line (geometry)Disk read-and-write headDefault (computer science)Meta elementLattice (order)Open sourceConfiguration spaceGame controllerPoint (geometry)FrequencyMereologyCollineationTraffic reportingRadiusProjective planeMathematicsCartesian coordinate systemNP-hardService (economics)Revision controlSynchronizationSoftware repositoryRepository (publishing)NamespaceComputer animation
Medical imagingVirtual machineMixed realityMathematicsData managementTransport Layer Security40 (number)Server (computing)AuthorizationCache (computing)Internet service providerOpen setComputer animation
CodeFile formatRepository (publishing)InformationOffice suiteNamespaceInformation securityIntegrated development environment40 (number)Social classProduct (business)Axiom of choiceGodMathematicsSelf-organizationSystem callMedical imagingRight angleCASE <Informatik>TelecommunicationCoefficient of determinationWeb pageRepository (publishing)Operator (mathematics)Service (economics)CodeCombinational logic2 (number)Cartesian coordinate systemComputer animation
MereologyRepository (publishing)Menu (computing)Duality (mathematics)Execution unitComputer configurationComputer chessSocial classInternetworkingSoftware testingMathematicsService (economics)Real numberStreaming mediaHuman migrationMedical imagingGoogolMobile appCache (computing)MereologyComputer animation
Process (computing)RotationCodeShared memoryFocus (optics)WebsiteSoftwareMereologyFocus (optics)Multiplication signWikiProcess (computing)Component-based software engineeringPhysical systemChecklistWeb pageComputer animation
Menu (computing)WebsiteSoftware maintenanceChecklistCore dumpExecution unitTelephone number mappingPatch (Unix)Operator (mathematics)Web pageMetropolitan area networkMultiplication signOvalRevision controlService (economics)Inheritance (object-oriented programming)AlgorithmOnline helpSource code
Military operationGreatest common divisorBuildingComputer-generated imageryComputer networkDefault (computer science)Information securityWeb pageOperator (mathematics)WikiGraph (mathematics)TheoryOffice suiteChainPhysical systemFigurate numberConnected spaceInternetworkingDependent and independent variablesMedical imagingRevision controlInformation securityAxiom of choiceMultiplication signProcess (computing)Projective planeState of matterLimit (category theory)Firewall (computing)Service (economics)Vulnerability (computing)Rule of inferenceOcean currentSoftwareSource codeComputer animation
Enterprise architecturePhysical systemService (economics)Web browserData storage deviceInformation securitySoftware maintenanceBuildingWindowSoftware maintenanceGroup actionIndependence (probability theory)MathematicsSocial classTerm (mathematics)Software testingSet (mathematics)Stability theoryChainControl flowRadio-frequency identificationCASE <Informatik>Medical imagingMultiplication signVirtual machineMereologyOpen sourcePlastikkartePhysical systemService (economics)Denial-of-service attackState of matterWeb browserLine (geometry)BuildingSystem callCondition numberData storage devicePlug-in (computing)Default (computer science)GoogolInformation securityBitHigh availabilityMiniDiscFront and back endsScaling (geometry)Data managementLatent heatComputer animation
Structural loadVertex (graph theory)Service (economics)Physical systemSoftware testingPhysical systemServer (computing)Software testingStructural loadBasis <Mathematik>Service (economics)QuicksortMathematicsRight angleWebsiteChainSocial classComputer animation
Computer-generated imageryVertex (graph theory)Scaling (geometry)Process (computing)Vertex (graph theory)Channel capacityMedical imagingNumberDefault (computer science)Buffer solutionBitCasting (performing arts)Social classChainMereologySpacetimeComputer animation
Vertex (graph theory)Service (economics)Data storage deviceComputer-generated imageryKey (cryptography)Data structureInheritance (object-oriented programming)Service (economics)Medical imagingData structureAffine spaceMultiplicationVertex (graph theory)Projective planeSocial classPoint (geometry)Game controllerProduct (business)Computer animation
Virtual machineStandard deviationDefault (computer science)BefehlsprozessorRead-only memorySpeicherkapazitätChannel capacityVertex (graph theory)QuicksortInsertion lossDifferent (Kate Ryan album)Virtual machineCausalityCore dumpMultiplication signBefehlsprozessorBitVertex (graph theory)Computer animation
Point cloudTotal S.A.Power (physics)Multiplication signBitOpen sourceEmailComputer animation
Point cloudFacebookOpen sourceComputer animation
Transcript: English(auto-generated)
So, welcome to my talk. Can you hear me at the back? Yes, great. So, I'm talking today about our road to a Kubernetes-based closed-build environment. My name is Sigi. I'm a Cloud Development architect at SAP, customer experience, and I'm sitting in Munich.
Now quickly the agenda. I will talk about the technology itself. I will also talk about soft stuff like the product aspect, workflows we introduced in our team, why we have built that, and the learnings at the end. So, just to give you a little bit of context up front, we are in a distributed team, Munich and Montreal. We are seven team
members. Two of them are working students and we build on that system probably something like a year. There are a few details I will not go over because they are company specific with networking and security stuff. But in general, it's very generic and I think
it's probably the same. You could experience the same thing if you start Kubernetes doing Kubernetes tomorrow. When I say closed-build environment, we are building a software package which is going out to the customer. So, it has to be tampered proof. It's not allowed to be changed while we are building it. And it also has to be audible. So, we have
to know the package which the customer is using on this infrastructure, on-prem systems, how we build it. So, there are certain details and you will see this popping up later on. We used to manage Kubernetes, which means for us, it's a small team. For some,
it's probably a big team. I have never had a team where I was fortunate enough to build on a CI-CD system for seven people. Never. So, I was the only guy doing that. But still, we would not be able to do this without a managed Kubernetes cluster. It would be too much work. Now, to give you an idea of what we do, Sandboard is an open source tool which we wrote and
is basically here to orchestrate Jenkins environment. We build Jenkins environment with a shift cookbook and there are seed shops. So, you spin up a Jenkins. The seed shop is creating all the build pipelines you want to have. It configures security.
It configures plugins and all this stuff. And we provide this to our teams. There are roughly 20 to 30 teams and all of those teams have product aspects. So, they are part of our product which we are building and they get all the Jenkins instances. Now, summer 2018, there was a little problem based on some management decisions.
Our old data center was being terminated. Basically, it's not my problem. But now, suddenly, we have to move everything over. And we did not want to just move stuff over. We decided to think what we actually want to do, which is a great situation to be in.
It's basically Greenfield. With the experience we had before, we knew one thing. If you have 20 or 30 teams, they do not care too much about a Jenkins. They do not care too much about the seed shops, updating the Jenkinses. And at the end of the day, we have to do this. So, we start. We said we want to provide a Jenkins.
But we also provide, instead of a Jenkins, we provide the pipeline as a service. So, my team is only here to provide the Jenkins instance. And the second team, which I will not talk today about, is actually building the pipeline as a service on our infrastructure. So, when I say customer, I mean our custom is the other team which is
building the pipelines. And they have the customers which are actually using the pipelines. Our requirement is really simple. We have to support up to 120 build nodes. So, we can pick up to 480 virtual CPUs and 1.6 terabyte of RAM. It will increase this
year to something like double. It has to be cost effective. Big company, you don't have budget unlimited. So, auto scaling was for us a must. We have to have auto scaling. We actually did not auto scale before. If you own your own data center, there are 100 machines just running. It was never a problem before. It has to be secure,
maintainable because we have to have the work, right? It has to be reliable. And at the end, it has to be have a customer facing your eyes. The customer facing your eyes are a real problem. If you look at Tekton and other modern stuff, you have to build a lot of stuff around it so customers can actually use it. And that's the reason why we still use Jenkins.
We have Jenkins and we still use it because we looked around a lot of current options we have. And Jenkins provides with the job DSL and a lot of stuff around like unit tests, graphs, access management. It provides so much stuff that we cannot just migrate away from it.
It's planned in the future to have a look on what's going on, how the landscape is changing, but it's changing quite a lot. So, Kubernetes to the rescue. We had already experience with Mesos Kubernetes 1.6, which is not that old, but there are huge difference between
1.6 and 1.12, 13 or 15 or 17 right now. We have experience with Chef and Ansible. It does support auto scaling on our platform, which we chose. It's Google. You can probably get auto scaling on all the other cloud providers as well. It handles certificates,
which is unfortunate. Our team also a little problem. Sometimes they just expire. And the certificates are for our internal services. We do not just use Let's Encrypt. We cannot just use Let's Encrypt. For us, it's more secure than VMs. You might know the one
or the other Snowflake VM sitting around somewhere, not have been touched for a year or two. We know the same problem. In Kubernetes, you have a node. It's super lightweight. There's nothing on it. It's auto update. You don't care about your underlying operating system. You just assume that it's there and it's there. It's personal opinion. The last one is probably the future. So, it is also a change in mindset. We switch from imperative to
declarative thing. Instead of hacking and creating your virtual machine, inserting software, configuring it, now you have to describe it before and then someone else takes care of creating that infrastructure for you. This is huge because you cannot start in our environment
with something you don't start with doing. You start by writing. So, everything is basically already infrastructure's code. And when you ever have the problem that something is not running and you just delete a whole namespace with your complete monitoring setup and you press some button and five minutes later, the whole setup is back there again back, you will love it.
It also includes all the dashboards and everything we have. It's based on code and you don't need to back up anything. The only thing you actually back up are the artifacts and you put them on some cloud storage and then magic happens and they take care of your backups.
Quick jump into Kubernetes. It's still quite young. So, in July 2015, we had a version 1.1. First, KubeCon was only in November 2015. Currently, we are already at 1.17 and it has quarter releases. What this mean, what it means for you and if you work with it daily and this
is one something we have no troubles with but it's something we are aware of. This ecosystem is really young. Everything changes. You have your open box because you just run into them and you see that they are getting fixed and two months later, new releases out and new features are popping up and the tools you're using now have suddenly
business enterprise features because a lot of people started using them and they requested those features. I was a Java developer for probably 10 years or something like this. Java is great and really stable but I've never had an ecosystem changing so quick. You have to keep up. You have to put the effort in. Don't run an old Kubernetes cluster.
If you still do and you don't have time for it, you're creating a zombie machine and it will be a horrible migration nightmare later on. I can guarantee you that already. Now, we started Greenfields. We have Kubernetes. So, instead of saying we are just an info team and we are just throwing something out and customers come to us and say, hey,
can you do this? Can you do this? We said no. We created, we see our product as a product. Product means it's customer focused. We don't just show our customers something on. We think on how they can log in, can use our tools, which UI they can use.
The documentation is a must. It has to be clear and understandable. If it's not clear and understandable, someone created a ticket and we have to do something and I personally don't like that very much. So, I'm trying to give everyone the chance to do it by themselves but in a way that they like it. It's really an effort that they like your infrastructure.
There was a team and a friend of mine told me that a work colleague came to him and said, hey, I really love what you're doing. I want to have something that's very similar. We also define our boundaries, what has the customer to do and what is our responsibility.
So, if the customer comes to us and says, we need this feature, we will look at our roadmap and check if it fits in our roadmap. Of course, the customer, we are building it for our customers but nonetheless, we don't stop our work just to provide something if we have to do our stuff.
Maintainability, quality and security. We are enforcing security by default. So, we are responsible for security. It's not a discussion I have made with a customer. I will come to this later on and we also have a maintenance window defined already. So, our customers know if you're using that service, there will be a maintenance window.
It's not that we have to use it but it's enforced. It's our responsibility. That's actually the entry point for our customer. It's a wiki page. It's very defined up there. Weekly maintenance window, all the documentation and links are over there. It's internally public, all available. Everything is publicly internally available. Everyone is happily able to browse it. It's very transparent and we give our customers
this wiki page and they have to try to get to understand our product. We are still trying to optimize it. It's a constant work in progress. Now, it's a really beginning of what we when we started developing. It's a little bit awkward
perhaps but stay with me. One of the first things we built was a namespace inspector. This is actually a tool which runs in our Kubernetes class and will delete everything which is not whitelisted. So, if you create a namespace and it's called I want to test something, one hour later it's gone.
We do the same thing with the default namespace. If you try to do something in the default namespace, it's gone. If you want to test something, you can use dev or test. Dev is deleted after one week. Test is deleted after two weeks. This creates a general idea that is you as a developer, you don't start with playing around.
You start by coding something because it could happen that it's deleted in a week because it's a weekend. One week is not that long and it's gone. And I think that saved us a lot of headaches where stuff is coming from because they can't. So, we can't. We actually enforce it to us. So, the second thing which was really important for us was the insight into our cluster. We started
with that probably the second day after the namespace inspector. We decided on Prometheus operator. It's a beta project and it basically provides you with a whole setup for Prometheus alert manager Grafana. It has already dashboards in it. It has a predefined alert. So, the guys
actually know what they're doing. I think there are a lot of Kubernetes operators which only operate Kubernetes clusters. So, it will pop up if your node disk is getting full, you have already the alert defined. You don't have to do anything. It just works. Last year on the fourth stem, I became aware of Grafana Loki. We do not provide access to our
cluster, to our customers. So, they want to see the log files, right? And Grafana Loki is basically this. It's a lightweight log aggregation tool. It's no longer an alpha. We started using it in alpha with the risk or the awareness of it. But having an Elasticsearch cluster running
in your Kubernetes cluster is huge, a lot of effort. I wanted to try to avoid this. And it basically does exactly what we want. It provides chunk and slaves and master log files to our customers. And we have log file dashboards for that as well. Now, that's a dashboard how it looks for the Kubernetes operator. In this case, it's just
a node. It tells you IOPs, network IOP, memory, CPU usage. We have our own dashboard. And this is probably the interesting thing here. It's already, you can see already the auto-scaling mechanism. So, we scale up 100 nodes. Two hours later, it's gone.
If you look at this, you can see on the right, left side, it's, we have also introduced a cost model system. So, we actually know roughly how much money we burn. It is really important that we have an eye on it. Because if we do not auto-scale, our bill would be roughly
something like 30 or 50,000 euros a month. And right now, it's probably more like 5 to 10,000 euros. There are a few bonuses. If you're in a big company, you can leverage something more. So, it's getting cheaper. But nonetheless, it's a big problem, right? Everyone has a cost problem. And you can, of course, not do this in your own data
center. If you scale up and down something you own already, it does not make much sense. So, now we need a tool for our customer to create the Jenkins. So, we are talking about customers which are creating the pipeline. So, they are very technical. But they only want to have Jenkins. And it should work. And it should be safe. And that's it. So, we are using cube apps.
It provides ham applications. It's a simple interface. It only does exactly that. It provides you, our customer, the way of spinning up a machine or application. And the values YAML file is basically the contract we have with our customers. We don't tell our customers,
you have to know how Kubernetes works. But unfortunate for you, you have to know how YAML works. So, how does it look like? It's a simple interface. Great project. Very interactive. It creates new, it gets new features. So, it's actively developed on, but it's really young. So, you had a problem with the namespace on top. It showed all the namespaces. Sometimes it did
not show, didn't show any namespaces. So, you constantly have new features on it. So, this is now a YAML file you actually have. We are trying to have all the important things customers really want to change on top. And we try to document it. But nonetheless,
a lot of stuff is down there. So, basically, what they can define or what they do define is a seed shop. So, they spin up their own Jenkins, tell them roughly how much memory and CPU usage they want. Which, of course, means if they create a big Jenkins slave, Jenkins master is 24 gig of RAM. If it's vCPUs, they will get it. But it will cost them a few hundred
dollars per month. Whereas, small Jenkinses for testing. This is really great in Kubernetes. It's really simple to just give your Jenkins two vCPUs and four gigabytes of RAM if you're testing something. Or you can give them a big one. Now, you have a lot of stuff configured already in your git repos. And now you need to have
Jenkins and all the other stuff, all the monitoring, all the Kubernetes applications, Helm charts in your Kubernetes cluster. How do you do this? I personally would highly recommend ArgoCD. It's a great open source project. It does nothing else than making sure that whatever you have defined as an implication is getting installed from
the git repository into your Kubernetes cluster. So, right now you see an overview of all the applications we have. We also configure our ingress controller to git. So, the git ingress controller is in git. ArgoCD makes sure that the ingress control how it's configured in git is configured in Kubernetes. So, it looks, if you go into details, you can
see the sync status. You see all the parts, the service accounts, ports, everything. And the greatest thing actually on ArgoCD, which I really love, is ArgoCD provides your own custom resource definition from ArgoCD. And you basically define an ArgoCD application
through YAML. So, what you can do and what you should do, and they actually build it in a way that you can do it like this, you configure ArgoCD through ArgoCD. You only have to put ArgoCD once on your Kubernetes cluster. Then ArgoCD sees, oh, there's an application called ArgoCD. I should maintain it and check it out. And suddenly, ArgoCD
manages itself, which is really great. The namespace inspector looks very similar. Not a lot of stuff to do. The target revision hatch, where the values are coming from. We inject and secure secrets from vault. Yeah. Now, how does a setup now look like?
It's not very big. It's not very complicated. But already, the base layer is 10 virtual machines. So, learning like 10, 40 or 50 vCPUs is what we already need. We have Prometheus, Grafana for our customers and for our setup. We provide
Jenkinses, but we also need to have a Jenkins to build images and everything. And we provide, the DEX is an authorized open ID provider. So, we basically tell, if you have UI like Prometheus, it creates an UI and it's unprotected. That's not something you can do in our company. So, we put a DEX upfront, connect the DEX with
our LDAP server. So, you can use your company credentials to log into DEX, and then you have access from users or alert manager. What we also have is an artifactory for caching. Reasons. Now, in our organization, you saw the environment, right? It's not that much.
But already, we have over 40 Git repositories. It's huge. We have, our product is one Git organization. And we have prefixes for all the repositories. So, Kubernetes means it's an application running in Kubernetes. It's a Helm chart, a combination of Helm charts. Shop is a Jenkins seed shops and the stuff we
configure. We also use our own dog food. We have our own seed shop, which creates all the Jenkins shops for building the images, doing security scanning and all this stuff. And Terraform. We use Terraform to create the Jenkins cluster, and then we never talk about
Terraform again. Every repository has to have a readme. And for the Helm charts, we have to how to maintain. I will come to this in a second. And we have also Ops manuals. We have to have on call, which basically means a service is not running. You get a call, you have to fix, you have to fix something. So, Ops manual only tells you what you can do. In case of monitoring, there are a few ideas, but there's also written that
you can delete this namespace and resync this with ROCD. Not a problem. And I think right now we could delete the whole class and click on ROCD and we'll just work. So, I think most of it, yeah, should. We have our documentation in Git. It's closer to the code. It's text-based formatting, which is much easier than in Word,
moving things around. It works great. And you can use GitHub pages if you want to, which is really lovely. Now, we have a lot of Helm charts. We have Helm charts for our Jenkins. We use upstream Helm charts. What do we do? We have literally a documentation on where these
upstream Helm charts are coming from, and you have to maintain it. So, you have to pull it down and you have to change something. We are not finished with the migration, but we have moved to Customize. So, Customize is patching our Helm charts we have from upstream. And we need to do this for certain things. We have priority classes on every service we use.
So, we have to have a priority class definition in our Helm charts. If it's not available, we can ask nicely if they added to the Helm chart, but nonetheless we have to add it. And all the images which are defined in the Helm charts, we are not putting them from the internet. If you do auto-scaling, you cannot do this because a node doesn't exist for too long,
and you don't want to cache it, and you have to cache it. You do it, and I like to cache it, right? So, in Google, it's GCR. You want to have everything in your GCR registry, basically. Yeah. Customize, it was unclear when we started what we want to use,
because there's a lot of options out there, and it's still not very clear, three charts released. But Customize is part of kubectl since 1.14, so this is a no-brainer anymore, and ArgoCT just supports it as well. Keeping it stable, now we have a lot of software running just to provide a few chances.
What we introduced is SAE. We named it SAE. It's probably not that SAE, what you know from Google talks, and there are plenty Google talks about this topic. It's a stable process. We know, we have this system in our team, so if I go on holiday, I know that certain things are happening. It's a weekly rotation, so everyone in our team has to do it, and everyone in our team has the
time and effort to actually have a look at every part of our infrastructure. And because we have so many different components just to run the Jenkins, you actually need this time. And it gives your team, or the rest of the team, the focus on working on the stuff they're working on without
getting interrupted all the time. How does it look? It's a wiki page with a timetable, and then there's a checklist. Nothing magic. Works great, though. Now we're upgrading and maintaining a lot. It's like you can assume one person per week, something like 10 to 30 percent
of the time. What we actually also built was Grafana dashboard, which shows us all the upstream chart versions and what we are running to see how we are progressing. And sometimes, or not sometimes, actually quite often, there are major versions, and we're upgrading again and changing again, which is really nice. It feels interesting if you read a new release note
from the algorithm and say, oh, this is a really cool feature. I was waiting for this. It's super helpful. You love it. But you cannot, I would not build something and just keep it there. It will rot away. The operations manuals, we have operations manuals in a service
page. If you wake up at three in the morning, you go to this wiki page, click through your ops manuals, the Grafana dashboard, and that's it. Now we are really close to the findings already. Now we have a little problem, right? Our customers are running current Jenkinses, and now they are not updating them. I'm not sure who knows that feeling
or not, but we have this feeling quite often. So we decided this time to do something else. And I think it's a good practice, and we saw it in a few other projects, how they do that. We build every night a Jenkins image. So there's a Jenkins master, and this image already
contains all the plugins. So we pull down the image, build the Jenkins, run the Jenkins, the Jenkins installs the plugins, the image is done. We tag it with latest and with current date time. And you as a customer, oh, and we deleted all the images in three months.
So basically there are only images available which are at maximum three months old. In theory, we could just delete certain versions as well if high vulnerability would be popping up, but it's not implemented yet. Now we tell our customer you have two choices. You choose latest or you use a certain date time version. And we tell you the limitation of that your image
is only available for three months, and then customers go figure and figure out how they want to maintain this process. To make sure that this Jenkins is not magically running for longer than three months, we just kill it after three months. We also do have network policies, port
security policies, everything in place. So if you just spin up a Jenkins and a customer would come here and say nothing to us, they cannot connect to the internet. And there are a lot of stuff just missing. But for us, it's more like the whitelist approach. They have to come to us, say we want to connect to this system. And we tell them no. Or if they really have a good
reason why they want to connect to the system, we will whitelist it for them. And then it's our responsibility that this connection works. We have to, on big corporate system, we actually have request firewall rules and stuff like this, and we have to put connection stuff in our security
concept. Our learnings. Stateful service on Kubernetes is hard. Kubernetes has not been built for stateful services. If someone tells you this works really well, I don't think they experience, they are lying, I think. For Jenkins, there is no high availability setup
available. I think there's some CloudBees magic plugin which doesn't work or work, and you just can't use it. For us, it's a big problem because our build, this build runs for 6,100 machines. So if one Jenkins breaks or five Jenkins breaks,
it has to be really reliable. Otherwise, I think one build costs something like $30 or $50. So you don't want to skip that. If you have much smaller builds, if you build it a little bit resilient, it's not a problem at all. Make sure that you have port priority. So a Jenkins master should not get thrown out because
a Jenkins slave is suddenly more important than a Jenkins master. It doesn't work out. And system ports will and should always win, and they will do this. So if you think, oh, your Jenkins, everything is fine-tuned, it's now stable, suddenly your Jenkins master is
system service, and it will kill your Jenkins. I will get in detail about this in a second. Don't think about this cluster is now set up. Assume that you can get rid of this cluster again. So what we do is we move our Jenkins artifacts to blob storage. Blob storage is
independent from our cluster. Everything which is persistent needs to be in Git or in some storage. Otherwise, it's not persistent, and we don't assume that it exists. Sometimes you have our monitoring, for example. Promitos only saves it for 24 hours, but our long-term storage with Thanos at the backend pushes it to TCS. There are persistent
disks which we use, but we don't expect that they can be deleted and should be allowed to be deleted. For the blob storage, for example, we actually build a small browser which is secure to add up, and then you as a customer, instead of just going to the end, you can use the
Jenkins artifact browser. But for any long-term thing, there is this TCS tool, and you go on this, and you can just use it, and it will always be there. Although a big issue, if you're building long runs, you have to be ready for maintenance. Maintenance is coming up.
In our case, Google is forcing us to do this. If there's a critical security issue, what do you do with a critical security issue? You fix it, right? Sometimes. Most of the time, you fix it. In this case, Google just enforced it for us, which is nice, I assume. They also have the security by default. We are our customers, and they tell us what to do.
But they also enforce a maintenance window, which is great for me because I go to managers and say, hey, GK is enforcing a maintenance window, we want to have a maintenance window, this is a maintenance window. No discussions needed. We tried at the beginning to make our services as resilient as possible, tell them, Kubernetes, you're not allowed to kill it, and that doesn't work.
In our specific case, Google just deletes it after an hour. It looks at it for an hour and says, yeah, yeah, yeah, okay, delete it. That's gone. Instead of now trying to make it as hard for Kubernetes itself to manage itself, we are trying it a different way. We are making sure that
if it's getting killed, and it's getting killed more often than you wish, it's more often getting killed than you wish for, we make sure that it's quickly back. Now, the auto scaling is a big problem. If you do auto scaling, you run with 10 or 15 nodes, you think everything is
fine, then you do a load test with 100 nodes, and suddenly your monitoring is not happy anymore, because it's just 100 nodes more than it was five minutes ago. You have to do load tests, look at it, it has to work. And there's a big problem, not a big problem, but you have to be aware of it. Calico, for example, has a wonderful daemon, which actually analyzes how many nodes
are running, and after every 10 nodes, it reschedules its nodes and wants to have more resources. And if you remember what I said with the system services, there's a system service, and it will kill your Jenkins, and instead of using 250 milli CPU, it suddenly uses
one CPU, and then suddenly you have three of them instead of two. And if you try to keep your cluster highly utilized, because you think, oh, cost, cost, cost saving, cost saving is important, but 100 bucks per month more just to be sure that
your system servers are not killing your Jenkins master is fine. It's slow. To auto scale something, it's great if you have Jenkins jobs which are running for an hour, no one cares. But a few minutes for something which just starts is bad. So what we do now, we are using auto scaling for cost resources, but now we overprovision. We have a port defined,
which is requesting the same amount of resources as a Jenkins slave with basically minus one priority. So when you say, oh, I want to have a Jenkins slave, oh, there's a port, no priority, I just throw it out, and then you have a smaller latency again.
Images are not cached. Your nodes are not running for that long. They might run for five but they're gone. If you already auto scaled to a maximum number of what you defined, we have to be sure that we don't auto scale to 1,000 because that costs a lot of money.
So we put it something like from zero to 10 for our default setup, which is where not our Jenkinses are running. And suddenly your auto upgrading don't no longer work because you already use 10 nodes. And if you need to upgrade and you are replacing one node by the other,
you have to have one additional node. So you also don't run your Kubernetes cluster fully full. You always have to have a little bit of buffer. And port preemption starts and you have weird behavior if you run full capacity. And we are trying to get, we are not running on this.
If you do this, it's effort. You saw what we put everything on one cluster and one node pool happily enough or good enough for us. It's super easy to define multiple node pools. You have node affinity and you say, yeah, Jenkins master, you get one node pool, only your
Jenkins slaves get another node pool. Keeps them apart. The Jenkins slaves are more secure because they are isolated. But it breaks your thinking of, you have one big cluster, I control it with priority classes and I run it to the fullest to save costs and have a nice, it doesn't work. At least it doesn't work for us. The project structure for Kubernetes cluster,
you have one cluster and you have tons of service accounts, images, everything. Start to annotate them upfront so you actually know why this image is in that one folder and who actually uses that. It gets really messy. If you think you have one Jenkins slave and it's just a small
Jenkins slave and you are getting away with one virtual CPU and two gigabyte of RAM, I think again, Kubernetes uses and needs certain amount of resources. As the bigger nodes are getting bigger, the setup itself gets more efficient.
We are running now between four and eight core machines. We do not go over under four cores. We probably stay more like on the eight or six cores per node to have a proper alignment. Which of course, if you go out of scaling, your budget is a little bit smaller than ours. It means the difference between 50 or 200 bucks per month just because
you cannot be that fine granular. I expect it to be very fine granular, but we solve this over very quickly. Thank you very much. I have a little bit of time for questions. Answer us if you have feedback, send me an email. If you want to talk to me about this infrastructure in more detail later on, I will wait outside. That's it. Big company,
it's really hard to outsource it. It's easy to inner source it. It's hard to outsource.