We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

What If Component xxx Dies? Introducing Self-Healing Kubernetes

00:00

Formal Metadata

Title
What If Component xxx Dies? Introducing Self-Healing Kubernetes
Title of Series
Number of Parts
47
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Producer

Content Metadata

Subject Area
Genre
Abstract
Kubernetes promises healing your application on all kinds of failure scenarios, but why not self-heal Kubernetes itself?
37
Euclidean vectorRadio-frequency identificationKnowledge-based systemsSoftware testingMaxima and minimaMereologyOpen sourceConnectivity (graph theory)Touch typingCartesian coordinate systemStructural loadLevel (video gaming)Multiplication signServer (computing)Computing platform1 (number)Process (computing)Core dumpCompilerData managementCountingScheduling (computing)Compilation albumPoint (geometry)Formal languageBootstrap aggregatingVirtual machineHypermediaBootingCombinational logicGene clusterRight angleChainDatabaseSocial classSoftwareDifferent (Kate Ryan album)DemonEmailDirect numerical simulationConsistencyFlagTotal S.A.Online helpProjective planeCubeAdditionAutomationIntegrated development environmentDistribution (mathematics)Group actionGame controllerLogicCommitment schemeRun time (program lifecycle phase)Suite (music)Arithmetic meanService (economics)Closed setException handlingSource codeBitHydraulic jumpXMLComputer animationLecture/Conference
Server (computing)Data recoverySingle-precision floating-point formatScheduling (computing)Server (computing)Product (business)Flow separationMultiplication signSingle-precision floating-point formatOffice suitePoint (geometry)Operating systemGame controllerMoving averageCartesian coordinate systemState of matterMonster groupRight angleConnectivity (graph theory)Data managementPlanningTouch typingCubeScaling (geometry)Gene clusterEntire functionData storage deviceWeb pagePersonal digital assistantPhysical systemDemo (music)MassComputer architectureLogicMiniDiscAdditionCountingCohen's kappaRevision controlVirtual machinePointer (computer programming)Archaeological field surveyConfiguration spaceVariable (mathematics)CASE <Informatik>BitDatabase normalizationScalar fieldPresentation of a groupDemonBootingCycle (graph theory)Civil engineeringStructural loadSet (mathematics)MereologyHigh availabilityReal numberDebuggerData recoveryCache (computing)Lecture/Conference
CodeGame controllerDuality (mathematics)Service (economics)BefehlsprozessorUniversal product codeTask (computing)Principal ideal domainRevision controlBeta functionPhysical systemComputer multitaskingMultiplication signPhysical systemBitVirtual machineComputer animationLecture/Conference
SummierbarkeitGame controllerService (economics)Physical systemCodeTask (computing)Principal ideal domainProxy serverBefehlsprozessorLocal GroupoutputData miningTouch typingComputer multitaskingComputer iconVirtual machineServer (computing)MereologyCubeComputer animation
CodeTask (computing)BefehlsprozessorService (economics)Hash functionPrincipal ideal domainBeer steinExecution unitGame controllerPoint cloudLocal GroupEmpirical distribution functionConnectivity (graph theory)Electronic data interchangeGame controllerBootingProcess (computing)Point (geometry)Virtual machineInstance (computer science)Connectivity (graph theory)Pointer (computer programming)Server (computing)Multiplication signPlanningDependent and independent variablesBitComputer animation
Server (computing)Proxy serverScheduling (computing)Broadcast programmingServer (computing)2 (number)Point (geometry)Multiplication signConnectivity (graph theory)CubeAreaComputer animation
Euler anglesInternetworkingInternetworkingMultiplication signSet (mathematics)Open sourceOpen setFeedbackPoint (geometry)AdditionEntire functionRight angleGradientLecture/Conference
System programmingXML
Transcript: English(auto-generated)
Cool, okay, all right. I'm a Max. I'm a test engineer at CoreOS. What I pretty much do all day is write logic that spins up clusters and then sometimes shoots things down and hope that everything comes back up.
So that's my job. You can reach out to me over social media if you have any questions as emails as well. That's fine. You can ask questions during the talk or afterwards. I can make that pretty free here. What does the company do that I work for? I work for CoreOS. We secure, simplify, and automate container infrastructure.
That is quite a broad topic. Maybe some projects that we're involved in. Definitely Kubernetes upstream and our own Kubernetes distribution. Then, for example, etcd, the database and CoreOS container Linux. These are things that we're well known for and
those things are actually quite important for self-healing Kubernetes and self-hosted Kubernetes that I'll jump into in a second. Okay, what is Kubernetes? Who's here familiar with Kubernetes? Okay, who's using Kubernetes? Okay, all right, cool.
Who here has ever heard of self-hosted Kubernetes? One, two, three, okay. Okay, all right, cool. So what Kubernetes in the end is is just a platform for running your applications. So you can think of it as a platform, as a service. You run your applications on top of it.
Now, Kubernetes offers you all kinds of nice features around your applications like easy deployment, easy scaling, takes care of a bunch of networking, and in general it just takes nice care of your applications. So they stay alive and healthy. Now all of this tooling is great, but not very useful if the underlying layer is in any way fragile, right?
So if Kubernetes dies, probably your application dies as well. So what we need to do is make that underlying layer very sturdy as well. And how can we do that? Well, we already have all this tooling to make our applications stay healthy. Why don't we use the same tooling to make sure our Kubernetes cluster stays healthy as well?
So what we'll do is we'll run Kubernetes in Kubernetes, and that is not in the second one, but actually in itself. You might think that's a little bit crazy, but for now you just have to believe me that that is possible, and we'll go into how that is actually possible.
All right, so that's the idea of self-hosted Kubernetes. Maybe self-driving Kubernetes maybe comes back from compilers, writing your own compiler in the same language that you're actually compiling to, and you can go on different levels of self-hosted Kubernetes. These are like five of them. Level four with DNS,
none of the core components really rely on DNS, so you can pretty much run your entire cluster, and then just run DNS on top, and you're self-hosted level four. That's an easy one. Then level three is a little bit more difficult. When you want to start a scheduler to schedule your parts without the scheduler that schedules that scheduler,
it's kind of difficult to achieve, right? Same counts, for example, for the controller manager. And then you can go further up, like the API server, the core component that everything needs to communicate through. If you don't have that running, it's difficult to start a Kubernetes cluster on something that doesn't have an API server yet.
And then you can go more crazy, like for example etcd, self-hosted etcd, as the brain of Kubernetes, as a database there, that is right now from our side still behind the experimental flag, so we don't fully support that. And then you can go even more crazy, self-hosting your kubelet, which would mean
the kubelet that actually talks to the container runtime, so the thing that actually starts containers. It's difficult to start containers without anything that can actually start containers. Okay, so I told you you need to believe me that self-host is actually possible, and we'll go today till level two.
We might point at some level one stuff, but that's pretty much it. And now I'm going to go into detail how we actually self-host our Kubernetes cluster. And that is possible. We are a nice little tool that is in the Kubernetes Incubator, called BootCube.
By the way, all the stuff I'm talking about today is all open source. You can all check it out today after this talk, and you can spin up your own Kubernetes clusters the same way. I'm not going to touch any closed source stuff, except I'm running my cluster in AWS, and that is not open source. Okay, so BootCube. We want to start our Kubernetes cluster, and we want to start it in a self-hosted way.
And what we need is, first of all, a node, a machine. And on that machine, we start the kubelet that talks to the Docker daemon in our example here, and we kick off BootCube. And what BootCube does, it first of all, with the help of that kubelet, it starts a full-fledged cluster. So here you now have an etcd, an API, a scheduler, everything in there.
But it's not a self-hosted cluster, it's just a normal cluster. But the nice stuff about this, as it is now a full cluster, you can start stuff on this. You could start your applications, but well, you could also start a Kubernetes cluster on this. So what we do is,
on that bootstrapping cluster, we start our actual self-hosting components. And these are the components that are going to be long-living, and that we're going to keep alive for a long time. The other ones are just throwaway, just for the bootstrapping process. Now, the self-hosted components will just idle around. They will pretty much do nothing at this point in time,
just laying there. And they don't really know anything, as the etcd cluster, that this level 2 cluster you have, is empty, doesn't know anything about the world. So what we do as the next step is, we transfer the bootstrapping knowledge into the self-hosted cluster.
And thereby, the self-hosted cluster now knows about its environment, and it knows about itself, running in itself. Now, we've got everything ready. We don't really need the bootstrapping components anymore. And what we do is, we delete those. At this point in time, the self-hosted components kick in.
They notice that they're needed at this point in time, and they take over the work. And at this point in time, we have a Kubernetes cluster running in itself. So, why all this madness? Why going so crazy? Why not simply starting a Kubernetes cluster and that's it? Well, first of all, we have a very small dependency chain this way.
We reduce the amount of tooling that we need in total. We have deployment consistency. The same way that we deploy applications from now on, we deploy our Kubernetes cluster. So, for example, if I want another scheduler, I just deploy it as a component in my Kubernetes cluster itself.
Then, in addition, Kubernetes offers a lot of nice tooling around introspection, how to debug my applications. And now, I can use the same tooling to actually introspect into my cluster and use all of that to debug the components that are in there. And then, in addition, maybe some of you went through this, updating Kubernetes is actually difficult.
So, as Kubernetes before can now nicely roll out new versions of your application, it can now roll out nice version of itself just by rolling deployment of its own components. And then, of course, easy high availability configurations.
We want to run this in production in the end. So, for example, we would like to have more schedulers to be redundant. So, what we can do is we can just, instead of kubectl scale my application, we can now say kubectl scale and scale my scheduler. It's the same thing. Kubernetes does all of this for us.
So, all of these benefits actually play in intrinsically in itself, but also are important for our self-healing idea. And what I want to do now is touch a couple of points where a Kubernetes cluster would possibly fail, and then see how self-hosted Kubernetes would react to this.
Okay, just so we're all on the same page, just look at the architecture that these examples apply to. We're running multi-master. It's going to be a production cluster, so you don't want to run single master, right? That would be single point of failure. That would not be a good idea. And on those masters, we first of all start the API server.
The API server is supposed to be a stateless application. It's not. It does a bunch of caching. But anyways, it runs on all of our master machines, and it doesn't really matter how many we have running at the same time. So we can run multiple at the same time. We don't really have any races here. In addition, we start the scheduler.
The scheduler is a little bit more of a problem. You don't want multiple schedulers to interact at the same time. You only want one scheduler to be active at a point in time. That is very important. Now, Kubernetes offers a very nice leader election feature for your applications that you can build in this feature given to you by Kubernetes. Now, as Kubernetes or the components of Kubernetes
are running as applications on our Kubernetes cluster, we can just use that leader election for our scheduler as well. So we can do leader election around our scheduler given by us by Kubernetes. So what we do is we only have one scheduler running at a time and all of the others just idling, just being followers waiting for the leader to die or anything like that.
Same thing counts for the controller manager. Of course, we run the active one on the different one and then all of the other components. And these now all manage our worker fleet on the right. Okay, let's go through some scenarios.
So what happens, for example, if one of the API server dies? Well, as I just said, we're running redundant, so we pretty much don't care at this point in time. You probably want logging in there, so if your API server dies every minute, you probably want to wake someone up, something is wrong. But there's no real reason when it happens once.
Kubernetes will notice that some daemon set is not successfully deployed on that node and it will just start a new API server on it. In the meantime, all the load is distributed on the other API servers and everything is still up and running. So next scenario, what happens if a scheduler dies?
Well, if it's a follower, again, you really don't care. The leader will just start up a new one. When it's a leader, it will die. One of the followers, they race for becoming leaders. One of them will win, take over, start the failed scheduler, and you're up and running again. No reason here to wake up any engineers in the night, for example.
Now, this all sounds pretty much like Happy Path. You don't really have to interact with anything. The second scenario is a little bit more difficult. What if all your schedulers die? What if all your controller managers die? Well, first of all, you're very much out of luck at that point.
That is very unfortunate. You had three machines and all of those three machines died or something. So that's pretty much unlucky, but you just have to interfere a little bit manually. What you have to do is you can use the bootcube recover tool at this point in time. What it does, it checks HCD as your store,
and it will take all the current state, bake that into manifests, and take those manifests and deploy them as a new cluster, and your cluster is back up. So you don't really have to debug at this point in time.
Well, all masters die. Okay, same thing. You can use bootcube recover and you get to go from now on. Okay, can I delay the questions afterwards because I only have five minutes left and I got a full demo, I think. Oh, okay. All right, go ahead then.
Do the masters also run the HCD cluster or is this separate? So that really depends. If you're, for example, going self-hosted HCD, you would run them on the masters itself, but this cluster, for example, does not host them on the masters itself, but in the separate cluster.
Okay, so running multi-master is the way you should do it in production, but that's maybe a little bit boring, so for this presentation let's scale it down a little bit and go single master. That's, of course, not the way you should supposed to do it in production, but it could happen, for example. Like two of your masters die and suddenly you only have one left.
What happens now? What kind of failure scenarios can we mitigate from here on? So what that means is that that single master is going to be your single point of failure. All your control plane components are going to run on that single master. So what happens, for example, if the only API server dies?
Nothing can communicate from now on. The entire control plane is dead at this point in time. Don't get me wrong, your applications will still stay running. They don't really need Kubernetes to run. But they do need it, for example, if you want to roll out a new version of your front end that you could not do at this point in time.
So what we here have is a little tool. It's called a checkpointer, and that's a little bit of logic that I'm going to show you. So what the checkpointer does, it's just a part on your node, on each master node, and it checkpoints important manifests. For example, the API server, it checkpoints that to disk every now and then. Now you've got those manifests on your disk,
and once, for example, your API server dies, the checkpointer notices it and brings back up that API server. So we have on the top, we checkpoint the API server. In the happy case, the API server dies, and we can bring that API server back up from the node itself
without needing a full Kubernetes cluster running at that point in time. Now that we have this API server back up, it's just a temporary one, now everything can function again. We can actually start the real API server on the left again and then kill the temporary one, and we're back at a normal cluster.
And here at the very top, we're just checkpointing again, going through the same cycle like we did before. So an API server failure is really not a big problem here. Now, another problem might be, what if that single master dies? What if that single master, for example, reboots or stuff like that?
This can, for example, happen if your operating system needs a reboot for your updates, but your cluster only has one master, so what are you going to do now? Well, the same logic here applies again. We use the checkpointer again, and this is pretty much the runbook from here.
So your master will come back up. Hopefully it still has the disk inside and the disk is still running. Then systemd will start. systemd will start the kubelet, and the kubelet will start the checkpointer. The checkpointer will look onto disk, what it checkpointed in the happy path before.
We'll see, oh, there's no API server. I should better start an API server. And then from that API server, you've got a full running API server. At that point, your kubelet can again communicate. It thinks it has a full cluster around it. It will notice, oh, a scheduler and a controller manager and so on was all scheduled on me.
And then it will start all of that, and your cluster is back up running. So I think I got five minutes left. I got a demo. There's a lot about nice talking, but very nice to see it as well. I hope those five minutes are good enough.
Afterwards, it takes a little bit of time. If it doesn't come up right away, we can show it back there later. Okay, what I got here is a Kubernetes cluster. It's running on AWS right now.
Show labels. I got these machines, these nodes here, and if I grab for master here, just to make sure, and you believe me, there's only one master here. So if I shoot this one master down, we got a little bit of a problem. So get pods and kube-system.
These are the pods that are actually right now running on that machine, and these are actually the pods that are running on that master machine as well. And what you see, for example, here, you got the kube-api server running inside your Kubernetes cluster. So this is actually a self-hosted cluster.
And what I'm going to do now here, down here, I just have the master machine, and what we can do here now is reboot it. And let's see if our Kubernetes cluster comes back up. So my SSH connection is lost. The machine is rebooting. You see on the top, the API server is still responding at this point in time,
so it's still in the process of rebooting. It still responds to my requests. It should soon disappear. There we go. Now the control plane at this point is dead. Nothing is responding to it.
I can try to SSH back into it in a second. And then we can look what comes back up first.
So what I'm going to do here is watch Docker PS. I'm watching for Docker to come back up and show what kind of containers are running on the host. It still takes a little bit of time. There we go. Docker just came up. Now the next thing is the check pointer is going to be started. There we go, the check pointer.
The check pointer is going to start another check pointer instance. Now the API server came up. You see here, kaddesk-cube-apiserver just came up from here on. Soon up there should be answered the request because we've got a running API server. Now the kubelet can communicate to that API server, knows what it needs to run on itself,
and there you go. All the components are coming back up, and everything is up running again. It still needs a second, but yeah, that's your failure scenario at this point in time. I think I should finish now at this point in time, or just a half a second.
All right, what are scenarios for future talks? I skipped a bunch of stuff today, so self-hosted etcd is definitely a story that I could cover an entire talk, and it opens up a whole new set of failure scenarios. Then, of course, AWS dies. That could happen again as well.
Then you could, for example, use Kubernetes Federation. That is as well an entire topic for an entire talk. And then, of course, the internet could die. That is going to be a very creative talk at this point in time. I'm not going to give that at any point in time in the future. All right, so as I said, this is all open source. If you want to try it out and check it out, please feel free.
We're very happy for feedback on any of these things, and as it's open source, it lives from contributions, so feel free to open up pull requests. In addition, if you want to get paid for creating pull requests on these repos, we're also hiring. We're hiring in San Francisco, New York, and Berlin. Feel free to reach out to me, Luca or Casey, for example, back there as well.
Yeah, that's it. Max, feel free to ask questions now after the talk and so on. Thank you very much.