We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

oVirt Hosted Engine: The Egg That Hosts its Parent Chicken

00:00

Formal Metadata

Title
oVirt Hosted Engine: The Egg That Hosts its Parent Chicken
Title of Series
Number of Parts
199
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
For several years now, oVirt has managed Virtual Machines. Then came the question: can you run oVirt inside a VM, which in turn will be managed by the hosted oVirt? In this session we'll look at the intricacies of an egg hosting it's parent chicken. We'll cover the various aspects starting with installation, going through standard operations, and ending with high-availability for the hosted engine. Participants will be able to get insights of this unique setup, which will save them a physical server (or even two) while allowing standard flows to run the same way they did in the past years
Scheduling (computing)OvalXMLUMLComputer animationLecture/Conference
InternetworkingContext awarenessComputer animationLecture/Conference
Standard deviationCASE <Informatik>Cartesian coordinate systemVirtual machineOvalComputer animation
FreewareOvalCartesian coordinate systemLecture/ConferenceComputer animation
Standard deviationBitInsertion lossQuicksortVideo gameSoftware maintenanceData centerMereologyCASE <Informatik>Connected spaceRight angleSoftwareStructural loadLecture/Conference
File systemStandard deviationComputer configurationComputer fileLibrary (computing)Real numberOpen setBitCommunications protocolCore dumpSynchronizationIdeal (ethics)Text editorMultiplicationLatent heatPhysical systemComputer animationLecture/Conference
Special unitary groupSlide ruleSign (mathematics)Surjective functionStandard deviationComputer fileOpen setInternet service providerElectronic mailing listMiniDiscHigh availabilityMechanism designLogicMereologyPhysical systemOpen sourceVirtual machineProjective planeLecture/Conference
Point (geometry)Standard deviationLogicXMLLecture/ConferenceDiagramProgram flowchart
Open setService-oriented architectureCASE <Informatik>Web pageData storage deviceLecture/Conference
Cycle (graph theory)Functional (mathematics)RootData recoveryMathematicsKnotSoftware maintenanceLevel (video gaming)Data storage devicePasswordCommon Language InfrastructureXMLLecture/Conference
Finite-state machineConnectivity (graph theory)LogicVariable (mathematics)State of matterHigh availabilityInformationData storage devicePhysical systemService (economics)Endliche ModelltheorieGroup actionHuman migrationStandard deviationOrder (biology)Computer animationLecture/Conference
Service (economics)LogicGateway (telecommunications)Connectivity (graph theory)Price indexScripting languageParameter (computer programming)Characteristic polynomialAreaAttribute grammarData storage deviceContent (media)BefehlsprozessorStructural loadLatent heatArmQuicksortInformationForm (programming)BuildingCuboidType theoryLecture/Conference
QuicksortPhysical systemResultantMechanism designBitGateway (telecommunications)Logic gateXML
NumberCASE <Informatik>Level (video gaming)Data storage deviceBitLecture/Conference
MereologyData storage deviceInformationType theoryAreaBlock (periodic table)View (database)MiniDiscVector spaceComputer fileMetadataGene clusterService (economics)Gastropod shellSummierbarkeitCrash (computing)Domain nameSpacetimeComputer animationLecture/Conference
Level (video gaming)MereologyData structureComputer fileXMLComputer animationLecture/Conference
Network topologyMereologyMultiplication signCellular automatonInformationPhysical systemCartesian coordinate systemPole (complex analysis)Video gameData storage deviceDemo (music)Right angleOvalOperating systemWeb pageSet (mathematics)View (database)State of matterFile systemMaxima and minimaVotingElectronic mailing listDataflowTimestampStudent's t-testTheoremAtomic numberHypermediaOperator (mathematics)High availabilityDomain nameMiniDisc2 (number)Service-oriented architectureXMLLecture/Conference
Configuration spaceSoftwareSemiconductor memoryPhysical systemData storage deviceInformationDefault (computer science)PasswordAnalytic continuationBefehlsprozessorMultiplication signDataflowHarmonic analysisStudent's t-testSource codeJSONXML
Physical systemOperating systemOperator (mathematics)Student's t-testAsynchronous Transfer ModeRule of inferenceSystem administratorService (economics)Open setInstallation artLecture/ConferenceSource codeJSONXMLUML
Computer filePower (physics)View (database)Junction (traffic)Task (computing)System administratorDomain nameData storage deviceVirtual machineSource codeXML
Default (computer science)Set (mathematics)CASE <Informatik>Normal-form gameVirtual machineLecture/Conference
Set (mathematics)Level (video gaming)Computer animation
Query languageTraffic reportingLevel (video gaming)Crash (computing)Data storage deviceSoftwareComputer simulationSystem callForm (programming)CASE <Informatik>Computer fileLecture/Conference
Goodness of fitGateway (telecommunications)Maxima and minimaState of matterSoftware testingOrder (biology)Source codeXMLLecture/Conference
Human migrationVirtualizationResultantVideo gameStudent's t-testTraffic reportingComputer animationLecture/Conference
Human migrationGateway (telecommunications)ResultantEvent horizonComputer animation
BuildingData storage deviceRevision controlMereologyLibrary (computing)Event horizonVideoconferencingPixelFitness functionOffice suiteDefault (computer science)Sign (mathematics)Set (mathematics)ImplementationElectronic mailing listStudent's t-testOperating systemEndliche ModelltheorieMultiplication signBlock (periodic table)Sinc functionRight angleStandard deviationNear-ringSpecial unitary groupLecture/Conference
Transcript: English(auto-generated)
Hi guys, so I suggest that we get started. Hi everyone. My name is Doron. I'm from Red Hat. I'm leading the SLA and scheduling team for the past two years. I've been in Red Hat for five years. And we're going to talk today about Hosted Engine,
or as we call it, the chicken and the egg, and we'll soon understand why. These are the things that we're going to cover. And if you have any questions, we'll take it at the end. So let's start. The main thing I would start with, with regards to Hosted Engine,
is the fundamental question. The fundamental question for us was, why did the chicken cross the road? This is something we were thinking about, or were trying to understand. Why do we need the chicken to cross the road? This is the real philosophical stuff, as you probably know. And we didn't, or we couldn't come up with the right answer in that context.
So, anyone has an idea why did the chicken cross the road? Anyone? Sorry? Nice, but no. Anyone else? Sorry? Okay.
So I hope that by the end of this session, we will all be able to answer this question. So, with regards to the Hosted Engine, let's start with understanding what is Hosted Engine all about.
So, we would like to use a standard OVID installation, but we need that installation to happen inside a virtual machine. That virtual machine has to be highly available, so we won't lose it in case of a host crash. The problem is that that VM is managed by the application it is hosting.
And that's very challenging. So that's the chicken and the egg problem. And you probably know the drawing, and if you know Escher, there are many more. So that's Hosted Engine, but why do we really need it?
So, the answer is very simple. It's all about money. It saves money. And if you have failover equipment and special equipment, it will save you a lot of money. So that's the reasoning behind it. But as you know, nothing comes for free. While we were trying to implement that,
we were actually looking into some serious challenges. So to begin with, we had the whole chicken and the egg stuff. How do we end up in a situation that the VM is running an application that is monitoring and controlling that same VM? That's quite a big of a headache,
which we had to handle with, and we will soon see how exactly we solved that. So that was the first part. But then, once setup is done, we need to make the VM highly available. We need to make sure that we handle network in case of network connectivity loss,
handle all sorts of troubles we have in life, as you probably know. Life is very hard in the data center, so we need to be able to handle unexpected things as much as we can. So load balancing, maintenance, everything we need to do with the standard equipment, we will have to do here, but as you can see, it's a bit more complicated.
So when looking at these issues and understanding what's the right or the possible way to do it, we were trying to look for existing solutions. So let's take a look at what we know. So some of you are probably familiar
with VMware's clustering file system. There's only a small problem with that. It's proprietary. We can't really use that. So that's one thing. The other option is we have, at least Red Hat has RHCS, or Red Hat High Availability, as it's called today, and there's another option of using pacemaker.
They have a standard file system. They use the core sync protocol and library. It's limiting you to a specific amount of nodes. You can't extend beyond that, and there's no real overt node support, which is something that most of our users
actually use overt nodes. So this is how we started. This is the market or the available solutions as we saw it, and we said, okay, let's try to think a little bit more about that. Let's try to consider a standard file system,
not a proprietary one, and go for sun lock list. Sun lock, I don't have a special slide on it. Basically, it's a locking mechanism. It was developed as a part of the overt community project, and it enables us to provide lists for disks or for VM.
So we were saying, let's make sure that sun lock provides us the locking mechanism, and we will work with NFS or other standard file systems. That should bring us into something which is simple enough. It's open source, so that should be the first thing. Open source, simple enough.
It's focused in virtual machines, so there's not a lot of logic behind it like the other solutions have because they would like to provide high availability for other resources and entities as well. So that's pretty much focusing in exactly what we need. So it's much simpler. It should be easier to implement.
So with this concept in mind, we decided to go and start looking at the architecture. We had several discussions. The architecture at one point was too complicated, so we decided to simplify things, go for a standard three-layered classic architecture.
In every one, you will have the UI, the business logic, and the data layer. So in our case, we have CLI, that's Linux. We have overt HA agent, and we'll soon see exactly what each of them is doing,
and we have the overt HA broker. The broker basically is connecting us to the shared storage, and that's it, very simple architecture. So let's start diving into it, starting with the CLI. I'm not going to read all of that. That's something you can actually do
with a simple minus-minus help, but in general, the CLI is very rich. It gives you the whole VM lifecycles, and even more, any storage-related functionality that we need. Let's see, status reports, password changes, anything else. It's very basic, I know, but it's Linux,
and it gives you all the functionality that you need in the beginning, including maintenance support, as we will probably cover later on. So nothing really, I mean, it's not rocket science, and a very basic CLI which will give you all the functionality that you need. Moving to the next level, we have the HA agent,
or as we call it, the brain. Basically, that's the component that has the state machines, the logic, everything that is related to high availability. It's a standard system service. It can crash and burn. We have another one that will make sure that we will handle this situation.
If something wrong happens to the VM, then that model will take action, either restart the VM, or migrate it, or do something else based on the state machine that is relevant to that situation. In order to get the information and connect to the storage,
we have another layer, which is the next one, data layer, and that's the HA broken. So the HA broker, or the middleman, is basically an intermediate layer between the logic and the actual data. It's a standalone service, so again, it can crash and burn.
We have another one as a backup. We use shared storage, so this is actually connecting to the storage. It's writing, and it's updating and reading from specific areas in the storage, and we will soon see exactly what we're doing in the storage.
So that's the storage part, and with regards to monitoring, it has a very nice, pluggable architecture, so you can actually create your own monitoring components. Basically, it's a small script, like a ping script, or something else, like SNMP, that should report back and give us some sort of indication.
So if we want to ping the gateway, for example, we have a very small script that is doing that, takes as an argument a gateway IP address, and it returns to all false, basically, whether we have a ping to the gateway or not. The same goes for CPU load
and three others built-in monitors that we have, but going forward, as you can see, this is pluggable, so you can add your own, and one day, we should be able to support other HAVM types, so you can actually be able to monitor other characteristics or other attributes of the VM,
based on the VM content. So that's about monitoring. The purpose of monitoring is to be able to provide us the information, and then the logic part, which is the agent, should be able to calculate the situation
and understand what should be done. So for that purpose, we made up some sort of a small system that is scoring a host based on the monitoring results that we got. So it's a very simple mechanism which is bitwise, and basically, every bit is telling us whether it's a ping issue
or a gateway issue or anything else. We end up with a score, and the score represents the host status. So in this way, it's very easy to get a number, and that number represents the host's suitability to run a VM which is highly available,
hosted engine in this case, and then all we need to do is compare every two hosts, which is rather simple. The bigger one will be the better one. So that's scoring. Looking into the storage. So the storage is a bit tricky because, as you understand,
this is the part where we synchronize all the hosts. All the hosts should be able to read the same information and act upon it, and we should be able to understand what happens if a host disconnects from the storage or crashes or something else happens, let's say that the services died, for example.
So what happens is that we create a special storage domain in the shared storage. We only create it once during the installation of the first node. That storage domain will end up holding the hosted engine disks
and two specific files. In NFS, it's files going forward. In other storage types, it will be just areas or blocks. So we have a sunlock metadata file, and we have the agent metadata file. Currently, we're supporting NFS. That's the first thing that we were able to achieve, but going forward, we will support other types.
So this is basically the path where we should find these files. And we have the log space for sunlock, and we have the metadata for the agent. So sunlock is pretty simple for the sunlock guys, but for the metadata, this is actually something new,
again, that we had to design. So let's take a look at the metadata. It's not very complicated. We're using four kilobyte chunks, which is one per host, and the zero one is basically a cluster-level stuff, so the first host will be using the first real chunk.
That's pretty much the general file structure. It's not very complicated, but the interesting stuff actually hides inside. So if we will dive to one of them,
we will actually see that it's divided into two parts. The first part is 512 bytes, and the second part is everything else. The reason why we had to split it is because atomic write, and we need to make sure that the writes are atomic,
can only be achieved at maximum of this size. So the first part, which is the first 512 bytes, simply holds a list of bytes. This is what you will actually see inside the storage, if we'll take a look.
The rest is something which is more human readable. It's for us. So we should be able to debug it, or if something happens, you can actually retrieve that data, take a look and understand what's going on. For example, we have a very nice timestamp, so you can actually see if the host disconnected or was not connected for a long time,
so that could actually be a reason not to use that host or to migrate away a VM from that host. There are several other things here, such as the scope, which is quite explicit, and everything else. So that's what we made sure that the storage will include.
We have atomic writes, which means every time the broker will write it, everyone else will be able to read that information. The same information is being read by all the other hosts. So that in general is all that we needed.
We have a locking system. We have a way to synchronize all the hosts together on shared file system NFS, and basically that's it. The only thing we're left with now is to solve the whole chicken and egg stuff, which means how do we install a VM with an operating system, with the application,
and then control that VM from the application it's hosting. So this is the basic flow. The first one is a bit long. The second one and later happens much faster. So we start with the basic setup.
Then we run the hosted engine HA, and we have VDSM being installed and create the first and only storage domain we have for high availability. Then we need to start the VM for the first time. It's an empty VM, and the disks are located inside the new storage domain we just created.
We are installing the operating system and OVID, and we need to reboot everything to make sure that it's persistent. We get, after the reboot, a list lock for that VM on that node, which leads us to a VM running the OVID engine.
So that would be the flow for the first host. Anything else we see is much faster. Basically, we are starting the setup, running the hosted engine HA. VDSM is installed. We are actually able to detect that storage domain. Sorry, this one already exists,
so we will ask the user, hey, is this HA, and you are working on the second node or above, and you will say yes. We will simply copy all the settings from the previous node, and that's it. You're all done. So let's see how it looks like in real life. I'm not doing any live demos here,
because as you all know, the network here is the best effort one, so I'm not taking the risk. So it's not going to look very good, so I use my amazing graphic support. It doesn't look very good, but in general, this is how we start. We run the setup.
It's telling us that it's about to start to create a VM, do everything needed, have a lot of stuff running here, and then we're being asked for the storage configuration. We should give the storage, shared storage path. As you can see, we're planning for Gluster,
and there's more coming. This is quite an advanced stuff for my debugging system. Then we will continue to network configuration, VM configuration. Then we can decide how much memory we want, what kind of CPU, how many CPUs, and we end with the hosted engine configuration itself.
Basically, we are asked for the ID of the first host, which by default is one. Okay, some other relevant information like password and so on and so forth.
Then it will keep running until it will finish everything. It will create the VM for the first time, and it will provide us a way to connect into that VM. As we remember from the flowchart, we now need to install the operating system. We install the operating system,
we reboot, the VM is automatically created again for us with the operating system now. We are being asked to install Overt Engine inside. This is running through, and that's basically the end. The HA services are being enabled,
and everything is restarted to get locks again officially. We have hosted engine successfully set up on the first node. That's what you'll see when you browse into the administrator portal. If you'll go to the hosts tab, you indeed will see the first host that we managed to install.
Okay, that's the first node. If we will go to the virtual machines tab, you will be able to see all the details of the virtual machine. We have hosted engine alive. The next task will be to come up with a second node or more.
That, as I said, is much simpler, and without that, we don't really have high availability. So we simply run a deploy. We can provide an answer file, but if we won't provide that, once we will understand that there is already a storage domain for hosted engine,
then we will ask you to be able to SSH into that machine, and we will automatically use SCP to copy all the relevant settings from the second node or more. We will only ask you for an ID for the new node that you just created.
Default is two in this case. Basically, that's it. You can go into the administrator portal. We will tell you. You have another one set up, and this is what you'll see. We have two hosts running now. Okay, this is after some place, but this is the hosted engine itself.
Now we have two nodes covering for the hosted engine. So if one of them crashes, it will be good. It will be resumed or restarted somewhere else automatically. Even if the host doesn't crash and we only lose the network or storage or something like that, then we will be able to migrate or resume to another host.
So that's basically the whole setup part, which gives us this very nice picture. If we want to try some simulations, as I said, this is all something I already made for you.
So this is a report you get from status query in the host level. So the first host is called hosted engine two, and you can see that in this case, we have the VM in an upstate. We have a good health status with the maximum score,
and the gateway is fine. On the second host, the VM is down because it's running here. We also have the perfect score of 2,400, and the gateway is down. In order to test it, I blocked the gateway in hosted engine two,
which is where the hosted engine VM is running. So this is what we're going to see. So the VM was running here. That's the host. If you take a look at the virtual machines, you'll see that there's one VM migrating. It's actually migrating away from here and into here.
If you think about it, the VM is running the engine and it's live migrating itself. So what you see here is while live migration is happening, which is quite amazing. That's the VM itself. You can see the status is migrating from,
and it's currently running on engine two, but it will move to hosted engine three eventually. So that's how it looks like, and that's the expected result. This is what will happen when we see the report.
You can go in and see that the gateway is not pingable, basically, and that's why we reduced the score, and that's resulted with the live migration of the hosted engine VM itself. Here, everything is fine and the VM is up. So basically, that's it.
We can return to the fundamental question of why did the chicken cross the road? Any answers? Could have been the gateway. Almost. From what we tested, it didn't cross the road, it was migrated.
That's it. Questions? Yeah. No, that's exactly what I showed. You need to run just the hosted engine setup,
but it will automatically vacuum all the settings from the first one. So it will only ask you what's the ID basically, and that's it, and you're all set. So the long installation only happens for the first time because we have the whole chicken stuff,
but once you're through, it's very fast. Okay. The question was whether it's possible to install the operating system of the hosted engine or via Pixiboot,
and the answer is yeah, definitely yes. Yeah. So can you repeat the question? Yeah.
Which cluster of S? Right. Oh, cluster. I heard cluster. So let me just repeat the question. The question is, I now understand,
is how far away is the cluster support? So in general, what we did here during the implementation is that we use standard VDSM libraries. We are working with VDSM to create and maintain the storage parts of hosted engine, and since VDSM is already supporting everything,
in general, it shouldn't be that difficult. We are now trying to, sorry. In general, we are now trying to make it work. So not a good idea.
You run through this. It's a nice recap. Yeah, another one. Almost done. So since we're using VDSM libraries,
we don't have a lot of, the gap is not that big, but we have to test it. So actually, cluster should come up in the near future with bare-metal cluster support or something like that. So we're considering taking advantage of that. But in general, for Sun and block devices,
we're now testing it, and I hope that we won't have a lot of money attached to it. So I would expect that to come in any of the coming next versions. I'm not sure if it's, well, it's not 3.4, but probably 3.5. Okay, the question is, is SEP also on the list?
So I was, previously I was discussing, I was telling you about, one of the solutions we saw was Pacemaker. So yeah, it's either SEP or Pacemaker. We're still considering that in the next major version, we may want to consider again SEP or Pacemaker.
So we're not ruling out that completely. It may still work for us, but we need some mileage to work with Hosted Engine to, again, it's like a new technology for everyone here. So we need to make sure that it's completely reliable. And then if we can improve that by integrating
with something like SEP or Pacemaker, then of course it would be very nice. More questions? Okay, thank you very much.