We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Auto-healing cluster through negative testing

00:00

Formal Metadata

Title
Auto-healing cluster through negative testing
Title of Series
Number of Parts
490
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
OCS stands for Openshift Container storage. It provides container-based storage for OCP(Openshift container platform). It’s easily scalable to bare metal, VMs and cloud platforms. Auto healing is a property of OCS cluster that auto heals a cluster component automatically when passes through an unexpected condition. A component can be a node, a network interface, a service, etc. To make sure auto heals just fine, we introduced negative testing. Negative Testing is defined as, a testing type that checks a system for unexpected conditions. In this presentation, We’re going to talk, what role negative testing plays, how to negative test components like node by shutting it down, deploying a heavy workload, etc. Similarly, for the network component, we are going to see what happens when the public network is disconnected along with many more scenarios.
33
35
Thumbnail
23:38
52
Thumbnail
30:38
53
Thumbnail
16:18
65
71
Thumbnail
14:24
72
Thumbnail
18:02
75
Thumbnail
19:35
101
Thumbnail
12:59
106
123
Thumbnail
25:58
146
Thumbnail
47:36
157
Thumbnail
51:32
166
172
Thumbnail
22:49
182
Thumbnail
25:44
186
Thumbnail
40:18
190
195
225
Thumbnail
23:41
273
281
284
Thumbnail
09:08
285
289
Thumbnail
26:03
290
297
Thumbnail
19:29
328
Thumbnail
24:11
379
Thumbnail
20:10
385
Thumbnail
28:37
393
Thumbnail
09:10
430
438
12 (number)Negative numberComputer clusterStatistical hypothesis testingVertex (graph theory)Process (computing)Data recoveryData recoveryOpen sourceNegative numberComputer clusterHydraulic jumpStatistical hypothesis testingRight angleStatistical hypothesis testingComputer animation
Statistical hypothesis testingSoftwareNumberStatistical hypothesis testingComputer clusteroutputNegative numberStatistical hypothesis testingCartesian coordinate systemComputer clusterHydraulic jumpPhysical systemComputer animation
Statistical hypothesis testingNumberStatistical hypothesis testingComputer clusteroutputComputer networkVertex (graph theory)Type theoryPhysical systemoutputCondition numberCovering spaceStatistical hypothesis testingCrash (computing)Negative numberComputer animation
Menu (computing)Duality (mathematics)Computer clusterMiniDiscVertex (graph theory)Entire functionObject (grammar)Operator (mathematics)WorkloadLoginData storage deviceComputer animation
Duality (mathematics)Computer clusterComputer networkVertex (graph theory)LoginComputer networkComputer clusteroutputStatistical hypothesis testingComputer animation
Ewe languageComputer wormComputer clusterComputer networkVertex (graph theory)Function (mathematics)outputComputer animation
Duality (mathematics)Computer clusterComputer networkVertex (graph theory)Menu (computing)Computer wormVertex (graph theory)Computer clusterCASE <Informatik>Computer animation
Statistical hypothesis testingComputer networkComputer clusterMiniDiscComputer wormDuality (mathematics)Computer networkConnected spaceComputer clusterMiniDiscComputer animation
Demo (music)Execution unitError messageFunction (mathematics)MiniDiscDemonData storage deviceComputer clusterVirtual machineMedical imagingRight angleComputer animation
Execution unitLoop (music)MiniDiscProcess (computing)Error messageComputer animation
RootMenu (computing)Link (knot theory)Maxima and minimaDirected graphThomas KuhnMiniDiscConfiguration spaceCodeGoodness of fitError messageStatistical hypothesis testingSoftware developerRight angleComputer clusterGame controllerFunction (mathematics)Replication (computing)Phase transitionComputer animation
Demo (music)Statistical hypothesis testingComputer clusterComputer networkMiniDiscMenu (computing)Disk read-and-write headIdeal (ethics)Inclusion mapDreizehnLatin squareExecution unitInterior (topology)BitCNNIntelPredictabilityComputer networkConnected spaceComputer clusterInterface (computing)ResultantExpected valueInformationComputer animationSource code
Computer cluster1 (number)PermanentStatistical hypothesis testingComputer animation
EmailPresentation of a groupRight anglePoint (geometry)Computer animation
Open setSpacetimeRight angleLaptopMultiplication signComputer animation
Open setStreaming mediaMultiplication signWeb pageElectronic program guideSurvival analysisSoftware repositoryEmailComputer animation
Open setGodComputer fileComputer animation
Point cloudFacebookOpen source
Transcript: English(auto-generated)
Ready? Yeah. All right. Hey, guys. I'm Rajat, and I work out of Red Hat, and we do lots of open source stuff. Before we begin, so this is where you guys can find me.
This is my GitHub and LinkedIn, so hit me up. All right. So, today we are going to talk about Autohealing Clusters and Negative Testing. So, before we begin, I wanted to know, like a show of hands, if anyone does any kind of negative testing ever done? Like, have you guys ever done any kind of negative testing? Oh, great. All right. All right.
So, before we talk about negative testing, we'll talk about Autohealing Clusters. So, what is an Autohealing Cluster? So, basically, it's a kind of cluster which basically monitors itself, and whenever there is a degradation of a cluster, it will start a recovery process. So, it will always make sure that there is no kind of degradation happens in a cluster.
All right. So, now let's jump into testing. So, I mean, this is what society thinks about people doing testing. So, all right. So, in particular, we'll talk about negative testing and how it works. So, these are some very basic examples for negative testing.
So, basically, what negative testing is your system or an application or a cluster should be ready to handle gracefully all the unexpected situations. So, we'll jump back. As you can see here is that these are unexpected situations. I mean, these are not the desired input
that our application wants. So, if a user or somebody enters these kind of input, our system should be able to handle these type of situations gracefully. So, we'll look into more complex example and more practical later in the slides. All right. So, why do we need negative testing
in OpenShift or Kubernetes, as we say, is to, obviously, to detect unexpected conditions. And if we'll cover all the unexpected conditions, we'll also prevent the cluster from crashing. So, okay. So, before we jump into practical scenarios, I just wanted you guys to see this.
So, this is the cluster. And this is the master node, the Kubernetes master node. And these are the worker nodes. So, basically, worker nodes are the one where you deploy a workload. So, as you can see, these are OSDs. So, OSD stands for object storage devices.
And in short, you can say that these are the devices or the disks which are used to store data. So, you can call them, you know, like normal disk storing data. Then you have the mon. So, basically, mon or monitor, as we call. So, this is used to watch over these OSDs. So, if there is any problem with the OSDs,
I mean, say, your OSD is not working fine or you are stuck somewhere while working with OSDs, you can always look into the mon, you can check the logs for the monitors, and you will have all your answers. And these are the RGWs, all right? I also don't know because I don't work with RGWs, so I'm sorry what this means. So, anyways.
And these are the rook agent or the rook step disc. These are the operators. So, in this entire worker node, if there is any kind of problem, like if you want to have an overview of what's happening, actually, you can always look into these ports. You can check the logs for these ports
and you will have your answers. So, now let's jump back. So, these are the practical scenarios I was talking about, where you can perform your negative testing. So, first is, what if the cluster gets disconnected from the network accidentally when an IO was happening? Okay, so an input and output is happening. So, suppose you are in this node, you are having one input and output,
and let's just say your network, you got disconnected with the public network, all right? So, what will happen? So, you can always test this scenario by yourself and you can have the output and you can check if you already wrote any kind of solution for it. What I mean to say is that,
what I mean to say is that, I mean, I'm sorry, I forgot. I mean, all right, the next scenario is what happens if my cluster, what happens to a cluster if a node shut down? So, say if my entire nodes got shut down, so what will happen? What will happen to the data?
What will happen to the monitors? What will happen to anything? So, these are the scenarios that you can perform and check. If you have the corner cases written for these kind of situations, then it's good, but if it's not, you need to write it. So, what are we testing today? We are going to test the disconnection
of the cluster from the public network. Again, we are going to disconnect one cluster from the public network, and second is detaching the disks from a running mon. So, as I told you, these are the monitors and there is a disk attached to it where this monitor runs. So, we are going to detach that and see the outcome.
So, I mean, I'm sorry. Yeah, can you see it? All right. So, here as it starts, all right, so this is the command to detach a disk from a monitor. So, this is the name of the VM, I mean the virtual machine where my monitor is running
or my node where my monitor is running. This is the name of that image and then I just disconnect it. All right. So, all right. So, disk is, I mean,
so now the disk is detached. Now, we check the output of what is happening. So, I want, yeah. So, this is the, you can say it's kind of a dashboard where you can see the entire health of your cluster.
So, right now, I'm running a self-cluster. So, I can check the entire cluster. So, right now, there is no problem with my clusters at all because health is okay. All my daemons are up. My OSDs, as I told you, my storage devices all are working since very fine. So, all right.
So, as you can see, as I just detached my monitor, disk from my monitor, my monitor got into error status. I was just highlighting it. So, and now, as you can see,
like the status of my self-cluster has been changed. It says that one of, one by three monitor is down. So, that means when I negative tested it, there was already a solution written for it. So, that means you have already verified it. You know that my cluster will work fine
if it will get into a situation like this. So, you need not to worry about it now. Just a second. And yeah, there it goes. I'm sorry. Yeah. So, as you can see, it's changed its status from error to crash loop back off.
So, that means it's kind of a loop. So, what I mean to say is that it started an auto-healing process. So, it is constantly running in a loop and it is trying to find the disk if it is there or not. It's just re-verifying it. But it's not finding, it's like my monitor is not able to find the attached disk. So, that is why it was,
it's just running into a loop to find it again and again. So, now what I'll do, I'll attach the disk again. So, here the command is. All right.
Yeah. So, I think the player is there. So, that is why you aren't able to see it. But yeah. So, if you can see the command right now, this is the command to attach the disk back again to the monitor. Now, we'll check the, what's the status again. So, right now the health is still in the warning phase.
Now, we got a new error for container creating config error. So, even if the disk is attached, still we are not able to get the monitors running again. So, this is the problem. So, now you can, you have an output. Now you can report this back to developers or you can write a code for this by yourself.
Is that whenever you do attach a disk, the cluster should be up and running again and again. But that doesn't work. So, now I have to manually delete the pod. And so, people who are familiar with Kubernetes,
it's a concept. I mean, you guys must be knowing is that it's a concept for a replication controller. Is that whenever you delete a pod, the pod will again come back up. So, that is going to happen. That is what is going to happen. I've deleted the pod and the pod will come back again.
I'm checking the pod status again. So, yep. So, as you can see. All right.
Okay. Again, the player is there. So, you cannot see it. But this is the mon which was in the pending or in the error state. But since the player is here, so you are not able to see the, oh, okay, good. I mean, so as you can see, it got into the running phase again
when we deleted the pod. So, this was one thing. This is the one testing that I did is redacting the mon. Now, we'll look into, now, I'm sorry. Yeah. So, now we'll look into disconnection of a cluster
from the public network. So, what we're going to do is that we are going to disconnect one entire node from the cluster by shutting down its public interface, public network interface. So, let's see what happens. All right. So, this is the node and this is the IP.
And now, we're going to get the, we are, we first search into the node. And now, we are going to get the public network interface for this node.
All right. So, this is the command that I'm typing for is to shut down the public network interface. So, once the public network interface is down, nobody should be able to access into the node.
So, this is what I was expecting before testing it. So, now, let's see the result. If it's behaving like the way it was coded, it's fine. But if it's not, then that's a problem. Yeah. So, I shut it down. Now, let's see the outcome
if I'm able to access into the nodes or not. Okay. Still, there is no response. So, that is a good thing is that, you know, once you have shut down the,
once you have shut down one node, it's not able to give back any kind of information. So, that's why it's into the, you know, there is no further command or it's, this command is not getting completed. So, now, here, I'm going to try, I'm going to try to access into the same node again
with the same IP. And as you can see, I got a warning that something nasty, it's possible that someone's doing something nasty. And I'm not able to access into the node. So, that verifies the thing is that once my public interface, network interface is down, nobody is able to access it.
So, just to verify again, this is the port that was running on the node that I just shut down. And let us just try to access into this port. And let's see if this happens or not.
Yeah. And again, I was not able to access into the port because the node that was, I mean, the port on which this, I mean, the node on which this port was running, that is already shut down. So, this happening as expected. So, yeah.
So, that is how I negative tested these. That is how I negative tested these two scenarios. And yeah, that concludes it. And if you have any questions. So, yeah.
So, firstly, you need to have some kind of unexpected situations figured out already, which don't happen very often. And then you can automate things if you want
or you can negative test it. So, I'm sorry.
Can you come again with the question? Yeah, sure.
Yes, explicitly. Yeah, yeah, yep, yep. So, yeah, node level.
Okay, yes, yes, yes, yes, yeah. Sorry. All right. So, the thing that he was asking, if I understood correctly, is that if my mon was to shut down
and if there is some kind of permanent cluster damage happens, then what ones do, right? Yeah, so.
So, maybe, like, people are not familiar.
If you receive an email, where can I select begin and end of the talk,
and then it will be recorded and published automatically, we will receive another email. So, right now, there is no point of doing it nicely if somebody wants to check it out, right now? So, right now, exactly, right now, no. Again, it will be available in one to two hours. You will get an email, okay? So, we have some presents for speakers.
Oh, really? We have a beer for you from SUSE. We've got these socks for speakers. So, this is from us. Thank you. Thank you. Okay, do we need anything?
We need to put this microphone, yes? HDMI, yes. Just put it, like, here? Oh, yeah. Here is good, yes. It's perfect, somewhere, in your belt or your pocket. Actually, I'll bring over a chair for some of it.
Between the sticker lines, this one, this one for the camera to see you. Ah. Yes, and if you sit there, the camera will not see you. So, pretty much, it will not be able to see you. Okay, maybe I can I put my laptop over here somehow. No, not really, I guess. Really? No, you can put it here if this is comfortable for you.
Ten minutes, I guess. We'll start right on time, right? Yes, yes, just for time. So, I can just, like, get it here. Yes, you can try. It's just, like, I don't know, this space is for you, you know? But what will happen if I accidentally hit any of these buttons?
Nothing. These are lights. You cannot hit them so hard that they do something bad. Yes, I think maybe I'll just be standing.
Or, I mean, you think? I mean, you can sit, but you hide behind the computer. Yeah, I'll just try to do it standing. It's just going to be some typing.
I'm not showing you how much time we have left. Yeah. So, when you have a question or discussion, repeat or summarize for the people who are watching on the live stream. And for the recording, speak louder if you can. Yeah, I think I tend to speak pretty loud at these things,
so I'm just afraid that I'll kind of blast these guys too much. I will control this on the camera. Ah, okay. Great. It's the first time I've tried to do this, I mean, basically in a very long time, so it's like, I think the first time it's just like a really cool conference,
but it's also very casual, like I sign up for this thing, and I just walk in here on the day, and then I didn't know if there was somewhere I had to go and sign up or something like that. Didn't you receive like an email with instructions for the FAQ on the website? Yeah, I tried to find some information, but I don't know.
We should do a Git repo. Yeah, yeah. Speaker survival guide. Yes, yeah, because I looked in there, like on the first page for like, if there was some notes for speakers or something like that, but it's not like... No, not really, I just like it. Five sentences.
True, but... You can send it to being recorded, everything is created, comments, that's it. Yeah, yeah, yeah, that's fine.
Is it up? Yes, I think it's in the sixth file. In question, I think I'm going to start a Google. Thank you, God.
No, I probably started two times last year, you know. There's no matter, these could have been the heads, but... Yeah, so now it's up. But for us, it's like a micro, I would say for us, maybe there's not so much to tell the file.
30, I thought it was 25. Yeah, but I thought we started 14, 25. We just moved it up, or what? Sure, ah, I believe it's 14, 25, let me see. Oh, okay.