We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Taking the red pill

00:00

Formale Metadaten

Titel
Taking the red pill
Untertitel
Charting the rabbit hole to improve FreeBSD performance on Xen
Serientitel
Anzahl der Teile
24
Autor
Lizenz
CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache
Produktionsjahr2014
ProduktionsortOttawa, Canada

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
The Xen hypervisor is a Open Source Type 1 hypervisor, it's widely used on production environments like Amazon EC2 and Rackspace. Since it's inception, one of the focuses of Xen was to be an OS agnostic hypervisor, allowing any kind of OS (with proper Xen support) to act as DomU/Dom0. This talk will cover how the Xen community works, together with an explanation of the ongoing work in FreeBSD in order to improve Xen support. This talk will cover the following points: Basic Xen description and specific Xen concepts. How the Xen community works (compared to BSD communities). A look into new Xen features (PVH). Work being done in FreeBSD improving Xen support. Probably a small demo to highlight Xen features.
Keller <Informatik>Demo <Programm>Coxeter-GruppeZeitbereichVirtuelle RealitätKontrollstrukturDatenmodellArchitektur <Informatik>VisualisierungATMKernel <Informatik>SoftwareKonfiguration <Informatik>EmulatorTranslation <Mathematik>Interface <Schaltung>IntelOverhead <Kommunikationstechnik>RechnernetzMini-DiscInterrupt <Informatik>BootenTabelleWeb-SeiteHardwareMaßerweiterungRechenwerkGlobale OptimierungPunktspektrumARM <Computerarchitektur>MarketinginformationssystemInformationSoftwareentwicklerKomponente <Software>QuellcodeTopologieDokumentenserverOrakel <Informatik>Stochastische AbhängigkeitDreiecksfreier GraphMailing-ListeWikiWhiteboardMailing-ListeCodeHardwareInformationMathematikOrdnung <Mathematik>ComputerarchitekturSelbst organisierendes SystemSoftwareDeskriptive StatistikTypentheorieGrenzschichtablösungÄquivalenzklasseBefehlsprozessorBinärcodeMaßerweiterungProjektive EbeneRechenschieberRechenwerkResultanteSpeicherverwaltungTreiber <Programm>Virtuelle MaschineSoftwarewartungVersionsverwaltungServerNichtlinearer OperatorNormalvektorProgrammfehlerTranslation <Mathematik>InstantiierungATMPunktEmulatorKernel <Informatik>Web-SeiteKartesische KoordinatenInstallation <Informatik>MinimumQuellcodePunktwolkeHauptplatineBootenOpen SourceWhiteboardSchreib-Lese-KopfSeitentabelleInterrupt <Informatik>DifferenteDomain <Netzwerk>Mini-Discp-BlockMikrokernelARM <Computerarchitektur>EvoluteWikiKollaboration <Informatik>Keller <Informatik>Overhead <Kommunikationstechnik>VirtualisierungDemo <Programm>NetzbetriebssystemInterface <Schaltung>Physikzentrum Bad HonnefDokumentenserverComputerspielImplementierungLie-GruppeSpieltheorieGesetz <Physik>AggregatzustandDivisionEindeutigkeitGrundraumIdeal <Mathematik>LastMereologiePhysikalisches SystemPrinzip der gleichmäßigen BeschränktheitSommerzeitVertauschungsrelationZahlenbereichSystemprogrammOnline-KatalogBasis <Mathematik>CASE <Informatik>Coxeter-GruppeDatenfeldAdditionVererbungshierarchieStrömungsrichtungSchnittmengeService providerStabilitätstheorie <Logik>GraphfärbungTurnier <Mathematik>Web SiteEndliche ModelltheorieBestimmtheitsmaßNeuroinformatikDreiecksfreier GraphMultiplikationsoperatorSchlussregelStandardabweichungOrtsoperatorComputeranimation
Orakel <Informatik>VersionsverwaltungMAPEmulatorPunktSichtenkonzeptVisualisierungATMSoftwareHardwareGlobale OptimierungFolge <Mathematik>Operations ResearchMechanismus-Design-TheorieNormalvektorSystemaufrufBefehlsprozessorROM <Informatik>TabelleWeb-SeiteEreignishorizontOrdnung <Mathematik>BootenTreiber <Programm>RechnernetzMini-DiscVektorraumCodeEinfache GenauigkeitInformationOverhead <Kommunikationstechnik>UmwandlungsenthalpieGewicht <Ausgleichsrechnung>Schreib-Lese-KopfComputerspielImplementierungInformationMathematikOrdnung <Mathematik>SpieltheorieDeskriptive StatistikHalbleiterspeicherWellenpaketMAPSoftwaretestGesetz <Physik>AggregatzustandArithmetisches MittelBitHyperbelverfahrenMaßerweiterungMereologieSpeicherverwaltungWarteschlangeZahlenbereichIntelligentes NetzNichtlinearer OperatorBasis <Mathematik>CASE <Informatik>Prozess <Informatik>Translation <Mathematik>ComputersicherheitFormation <Mathematik>Lesen <Datenverarbeitung>UmwandlungsenthalpieStabilitätstheorie <Logik>DateiformatSichtenkonzeptWhiteboardInterrupt <Informatik>DifferenteSystem FMultiplikationsoperatorRandwertStandardabweichungSchreiben <Datenverarbeitung>Rechter WinkelNetzbetriebssystemOffice-PaketDemoszene <Programmierung>Mechanismus-Design-TheorieEinsRhombus <Mathematik>CodeHardwareSoftwareTopologieVektorraumVideokonferenzBefehlsprozessorInjektivitätKanalcodierungRechenschieberTabelleTreiber <Programm>E-MailSoftwarewartungSystemaufrufVersionsverwaltungReelle ZahlCoprozessorNormalvektorATMSchnittmengeEmulatorInformationsspeicherungKernel <Informatik>Protokoll <Datenverarbeitungssystem>Web-SeiteSpeicheradresseVirtuelle AdresseMinimumEreignishorizontBootenSchreib-Lese-KopfSeitentabelleMini-DiscARM <Computerarchitektur>EvoluteOverhead <Kommunikationstechnik>VirtualisierungInterface <Schaltung>Computeranimation
PunktBootenFolge <Mathematik>SpielkonsoleKernel <Informatik>ROM <Informatik>MAPSerielle SchnittstelleAssemblerBefehlsprozessorHardwareGebäude <Mathematik>TabelleInterface <Schaltung>Architektur <Informatik>SystemaufrufHypercubeKeller <Informatik>Textur-MappingDomain <Netzwerk>EreignishorizontInterrupt <Informatik>ProgrammbibliothekCodeTreiber <Programm>DefaultSystemprogrammNetzwerkverwaltungInformationsspeicherungOffene MengePunktwolkeWurzel <Mathematik>Mini-DiscSoftwaretestWebDAVKonfigurationsraumElektronische PublikationDemo <Programm>MultiplikationMinkowski-MetrikPatch <Software>W3C-StandardFreewareHorizontaleVisualisierungSystemplattformImplementierungOrdnung <Mathematik>Selbst organisierendes SystemSoftwareSpieltheorieSprachsyntheseHalbleiterspeicherProfil <Aerodynamik>MAPImpulsSoftwaretestPhysikalisches SystemTabelleQuick-SortOvalSystemprogrammGewicht <Ausgleichsrechnung>CASE <Informatik>Formation <Mathematik>PunktInformationsspeicherungUnrundheitKartesische KoordinatenUmwandlungsenthalpieVirtuelle AdresseVollständiger VerbandStabilitätstheorie <Logik>RahmenproblemWorkstation <Musikinstrument>EreignishorizontBootenSichtenkonzeptWhiteboardWeb SiteKonditionszahlDifferenteDomain <Netzwerk>AutorisierungNeuroinformatikMultiplikationsoperatorMinkowski-MetrikRechter WinkelRhombus <Mathematik>CodeDatenverwaltungHardwareDeskriptive StatistikHackerProgrammbibliothekBus <Informatik>ParserHyperbelverfahrenRechenschieberTreiber <Programm>SystemaufrufKonfigurationsraumVersionsverwaltungGüte der AnpassungSerielle SchnittstelleHypercubeGemeinsamer SpeicherStrömungsrichtungPartitionsfunktionSchnittmengeKernel <Informatik>SpielkonsoleElektronische PublikationDiskrete-Elemente-MethodeSchreib-Lese-KopfSeitentabelleInterrupt <Informatik>Mini-DiscPatch <Software>System FKeller <Informatik>Overhead <Kommunikationstechnik>VirtualisierungDemo <Programm>Interface <Schaltung>DefaultDokumentenserverComputeranimation
SichtenkonzeptNabel <Mathematik>RahmenproblemTermMittelwertBildschirmfensterKernel <Informatik>SpielkonsoleBootenDiskrete-Elemente-MethodeEndliche ModelltheorieDemoszene <Programmierung>ComputeranimationProgramm/Quellcode
HilfesystemTermNabel <Mathematik>RahmenproblemSichtenkonzeptBildschirmfensterHalbleiterspeicherKernel <Informatik>BootenSchlussregelUmwandlungsenthalpieProgramm/QuellcodeTabelle
BildschirmfensterTermNabel <Mathematik>SichtenkonzeptElastische DeformationPrimzahltestARM <Computerarchitektur>Lokales MinimumVerhandlungs-InformationssystemMUDRahmenproblemVarianzInformationKernel <Informatik>BootenBildgebendes VerfahrenBefehlsprozessorBitQuaderMailing-ListeSpielkonsoleSichtenkonzeptDomain <Netzwerk>AuswahlaxiomLastIdentitätsverwaltungProgramm/QuellcodeJSON
WebDAVMehrwertnetzBildschirmfensterNabel <Mathematik>SichtenkonzeptAggregatzustandZeitbereichHilfesystemRahmenproblemTermBenutzerprofilFunktion <Mathematik>Gebäude <Mathematik>BefehlsprozessorSpielkonsoleBootenSichtenkonzeptDomain <Netzwerk>CASE <Informatik>NeuroinformatikProgramm/Quellcode
BildschirmfensterTermNabel <Mathematik>RahmenproblemSichtenkonzeptOrdnung <Mathematik>HalbleiterspeicherGebäude <Mathematik>BitTabelleTreiber <Programm>VersionsverwaltungKernel <Informatik>RichtungUmwandlungsenthalpieSpeicheradresseRahmenproblemBootenDefaultSpieltheorieAggregatzustandZahlenbereichSystemaufrufSpannweite <Stochastik>PunktSichtenkonzeptProgramm/Quellcode
BildschirmfensterSichtenkonzeptTermNabel <Mathematik>ZeitbereichRahmenproblemAggregatzustandGesetz <Physik>MIDI <Musikelektronik>VerschlingungSoftwareTreiber <Programm>Offene MengeMailing-Listep-BlockCASE <Informatik>NeuroinformatikProgramm/Quellcode
BildschirmfensterSichtenkonzeptTermNabel <Mathematik>RahmenproblemEuler-DiagrammEinfach zusammenhängender RaumOffene MengeBootenHardwareSoftwareVirtualisierungSystemprogrammFreewareOrtsoperatorProgramm/QuellcodeJSON
Nabel <Mathematik>RahmenproblemSichtenkonzeptEmulationHalbleiterspeicherComputersimulationQuick-SortStochastische AbhängigkeitEreignishorizontInverser LimesWeb-SeiteProgramm/Quellcode
Transkript: Englisch(automatisch erzeugt)
and we've been doing on previous D. I will start with a brief description of Zen. I know that some of you probably are familiar with Zen, but I guess others are not that familiar. So I will start with a brief description, followed by some notes about how the Zen community works. Then we'll peek into Zen new features.
That's mainly the new ARM support and PBH guest. And then we also continue with the recent work that has been doing in FreeBSD. That's mainly the work done on FreeBSD 10 and the work that's ongoing on FreeBSD head. Then I will do a brief description
of the Zen tool stack that's used to manage the hypervisor. And then I will do a demo of FreeBSD Tom 0. So the Zen architecture is quite simple. It's a type 1 hypervisor. We have the hardware on the bottom that contains the devices. And then we have a thin layer that's the Zen hypervisor. It's more or less like a microkernel.
And it doesn't use, it doesn't contain drivers for hardware devices. It only manages the CPU, the memory management unit, and the IOMMIU. Everything else, it's passed to the control domain of what we usually call DOM 0, which contains all the hardware devices. This way, we can use hardware drivers from other operating systems, basically,
and we don't have to put them inside of Zen. In this case, we can see that the control domain has the hardware drivers. And then I have two other domains next to them. One is a para-virtualized domain that we usually call the PV domain that contains a net front and a block front that's
connected to a net back and a block back in the control domain or DOM 0. This is done so that the guest can have access to IO devices. And then on the other hand, we have a fully virtualized domain that we usually call HVM domain, which in this case, it's not using the para-virtualized devices. And it's connected to a device model, that's
QEMO, that takes care of emulating the devices for the guest so that the guest can see a normal, well, the guest feels like it's running inside of a normal computer, basically. If you have questions during the slides, please feel free to ask at any point.
So para-virtualization was developed in the late 90s. It was a project done by the Genome Server Research Project at Cambridge University. It is together with Intel and Microsoft. And they realized that the JT6 instruction behaves differently in kernel and user space.
So basically, you only have two options for virtualization. You either have to use full software emulation or binary translation. This is a problem because it adds a lot of overhead to virtualization. So they decide to make a guest aware of virtualization. And they provide a new interface to the guest operating system that
had a much lower overhead. And the results of this is what we know today as para-virtualization. I would like to note that this was done in the late 90s. So if it was done today, it will probably be much more different. So they realized that there were a bunch of interfaces that had to be emulated,
which added a lot of overhead. So instead of doing this, they decided to para-virtualize them. They para-virtualized the disk and network interfaces. They para-virtualized the interrupts and timers. And they also had to para-virtualize the page tables and the memory management unit because there was no hardware virtualization at that time. And they also had to use para-virtualization
for privileged instructions. On the other hand, full virtualization was introduced to ZEM when the hardware virtualization extensions arrived to CPUs and basically allows you to run any kind of guest. In order to provide the devices for the guest,
the emulated devices, we used QEMU, which is an open source project that has been super upstream. And then this also allows us to make use of the nice hardware virtualization extensions, like EPT and the AMD equivalent of EPT, basically. And we also allowed these kind of guests
to make use of PBN interfaces if they want to in order to obtain better performance. Here I have a slide about the different virtualization mode that ZEM supports. On the bottom, we can see PV. And we are using the PV interfaces for everything. We are using PV for this kind of network.
Interrupts and timers are also para-virtualized. The emulated motherboard is also para-virtualized. And the page tables and the privileged instructions are also done using para-virtualization. And on the top, we can see HVM, which is a fully virtualized guest that's using hardware virtualization for the page tables and the privileged instructions. And then it's using software virtualization
for everything else. This leads to quite poor performance when doing IO. So we can install PV drivers inside of HVM guest in order to make them perform better when using IO operations. This is very similar to built IO. And it allows us to run these guests with much more performance regarding IO.
Then we also have PVHVM, which is another evolution of HVM, which allows you to use PV drivers for interrupts and timers. And we can remove quite a lot of the overhead there. The only remaining point is that even if you are using PVHVM, you still need to boot using an emulated BIOS.
So it means it's much more slower. And it means we still require a QEMU instance in DOM 0, basically. The Zen community was created in 2003 when Zen was presented in the symposium of open source operating systems, I think it was called. And it was released as a GPL 2.
Then in 2003, it became a Linux Foundation collaborative project after being worked on during 10 years. The governance of the Zen project is very similar to the Linux kernel. We have several rows, like committers and maintainers. And we also have several sub-projects
that are inside of the Zen project itself. We have the Zen and ARM hypervisor, which are the main projects that share the same code repository. And then we have Zappi, which is a more advanced open source tool stack that we use on Zen server. We also have MirageOS, which is an exokernel that allows you to run applications inside of Zen
as virtual machines. And we also have PBOps, which is a project that started, I think it was more or less four years ago, and was focused into bringing Zen support into upstream Linux kernels. As of Linux 3.0, all the Zen support for either DOMU or DOM0
is inside of the Linux kernel. So this project is more or less not doing much things anymore, basically. I said, the roles of the Zen project are very similar to the Linux kernel. We have maintainers who own one or more components inside of the Zen source tree.
And then we have committers, which are maintainers that are allowed to commit changes into the Zen code repository. Then we also have the sub-projects and teams, which are run by individuals or organizations and are either based on the Zen project or are related to it. You can find more information about the governments of the Zen project in the web page at the end of the slide.
So the Zen hypervisor is the main project. As I said before, it contains support for running on either E86 or ARM, and is led by five committers. Two are from Citrix. One is from SUSE. And the other two are independent committers.
During the 4.4 release cycle, which is the last release version of Zen, we have commit from 81 individuals from 28 organizations. And the main organization, well, we have a lot of organizations that contribute to Zen.
But the main ones, to name a few, would be Citrix, SUSE, Oracle, Intel, or Amazon, which is using Zen on its cloud. You can also find much more information about all this in the wiki. In the Zen wiki, we have pages that lead the different individuals and organizations that have contributed to Zen releases.
Any questions? So the recent Zen changes mainly include the support for running on ARM boards
with virtualization extensions. We also added a new mode called PBH, which we'll see in more detail in the following slides. And as usual, we had a bunch of bug fixes and improvements across all components. Zen on ARM started in 2011, and it was focused on bringing Zen into ARM boards
with virtualization extensions. There was a project before that that focused on bringing Zen to run into normal ARM boards without virtualization extensions, but it was never merged back into the Zen source 3. So it was abandoned at some point. If you want to run Zen on ARM, you need one board with virtualization extensions.
That should be either a Cortex-A7 or a Cortex-A15. And the recommended release is the last release, which is Zen 4.4. We also have support for 64 ARM chips, but I think there are not many out there in the market. So it's actually quite hard to try to find one of them.
I don't work with the Zen on ARM project, so I don't know much about its internals, but you can find much more info in the web page at the bottom of the slide. And if you want, you can also ask questions on Zendabel or the other Zen mailing list. Then on the 4.4 release, we added a new virtualization
mode that's called PBH. And it's very similar to PBHVM. It's mainly a PB guest inside of an HVM container. We believe that this mode will bring the best performance, because it takes the best parts from both HVM and PB guest. Basically, you don't need any kind of emulated devices.
And the guest has a hardware-emulated MMIU, which means that you don't need to use the PB interface for managing page tables. And also, the guest has access to the same protection levels as you will find on bare metal. It was originally written by Muker Rator at Oracle,
but significant revisions was done by George Dunlap in order to upstream it into the ZenSource tree. So now I've extended the virtualization spectrum, and I've added PBH. As you can see, it's another evolution of PBHVM, basically. And we are using the PB interfaces
for the emulated motherboard, which means that we are using PB when we boot. We don't need any kind of emulated videos or anything like that. So a PBH asset runs inside of an HVM container. It doesn't require the PB MMIU code, which is very intrusive.
The Linux kernel maintainers always complain about the PB MMIU stuff because it's very, very intrusive. And it also allows the kernel to run with normal privilege levels because we are using virtualization extensions on the CPU. So everything related to the CPU and the memory management unit, it's virtualized by hardware.
We also disable all the emulated devices, the ones that are emulated by QEMU or the ones that are emulated inside of Zen, like the local APIC or the IOAPIC are disabled. So the guest doesn't have access to any of them and has to use the PB path. We also have to use the PB path for VCPU bring up. This means that when we want to bring secondary processors
we don't take the normal, what you would do on bare metal, basically. Oh, sorry. And then it's much more simple. Actually, because the way secondary CPUs are brought up on Zen is much more simple.
You just have to use an hyper-call. So it's actually quite better than doing it the bare metal way, basically. We also have to use PB hyper-calls for a bunch of stuff like fetching the memory map from Zen because we have no emulated bios. So we have to fetch the memory map from Zen itself. And then it uses the PB HVM callback mechanism
in order to get the interrupt from Zen. We'll see more about this mechanism in later slides. So the difference with PB is that on PB H, the page tables are in control by guest. You can do whatever you want with the page tables and you don't have to use any kind of PB operations to manage them.
The interrupt descriptor table is also in full control of the guest because it's virtualized by hardware. And you no longer have the difference between a PFN and an MFN. Previously on PB guest, there was a difference between memory addresses because you were actually seeing the real memory address behind it. So we had some kind of three level translation
inside of guest, which was quite intrusive. And we also able to use the native syscall and sycenter. We have no kind of specific Zen callbacks like the Evan or failsafe callbacks. And the guest is able to run with the, well, the guest is able to manipulate the IO privilege levels.
The difference with PB HVM is that since we no longer have a BIOS, we need to add ELF nodes to the Zen kernel. This is done so that, sorry, we need to add ELF nodes to the FreeBSD kernel. This is done so that Zen knows where to load the kernel and how to jump into it. We also boot with paging enabled,
which means that it's much faster than booting through a BIOS. And there are also slightly difference in the way we set up a grant table and Zen store, but it's very minimal changes. And finally, we have no emulated devices. So we are not able to use the local IP or any kind of HP Tamer or anything like that.
Now I would like to speak a little bit about the Zen support in FreeBSD 9. On FreeBSD 9, you were either able, you could run previous D as a PV guest on 32 bit, but you were limited to only one vCPU.
We didn't have super for SMP. Or you could also run in Zen as an HVM guest with PV drivers. This is for both 32 and 64 bits. This means that we already had the Zen store and grant table implementations. We already have event channel support inside of FreeBSD. We had a super for PV disk and PV network interfaces.
And we also had implemented the suspend and resume code inside of FreeBSD in order to migrate the guest from different host. Then on FreeBSD 10, we added a PV HVM support, which means that we had to add a super for the vector callback.
I said, we'll see that in the following slide, exactly what the vector callback means. We unified the event channel code with the PV port. This means that we had less code duplication inside of FreeBSD. And we also implemented the PV timers and PV APIs. This was done in order to reduce the overhead of emulation.
And finally, we had to implement the PV suspend and resume protocol in order to re-enable the PV timers and the PV APIs. Here I have a description of, well, tries to be a description of what the vector callback is. On previous FreeBSD versions,
we were using a PC interrupt in order to get events from Zen. This means that the interrupt was global to the operating system. We could not inject interrupts into specific CPUs. We were just injecting a PC interrupt that could be delivered to any CPU. And in fact, Zen would inject an interrupt that was received by the Zen PC driver.
Then this driver will deliver the interrupt to the event channel of call. And finally, this interrupt will be delivered to the PV disk or PV NIC interfaces. On previous D10, we added support for the vector callback, which means that we have an IDT vector that allows us to deliver interrupts
to different vCPUs. Then when Zen injects an interrupt, this interrupt is delivered to the event channel of call code directly, and we can inject this to either PV disk or PV NICs as it was on previous FreeBSD versions, but we can also use that to inject interrupts for PV timers and PV APIs.
So the PV timer is implemented using hypercalls, PV hypercalls, and we are implementing an event timer using the set single-shot timer hypercalls, which allows us to get interrupts from Zen at specific intervals.
We also provide a time counter using the information that's shared by Zen in the vCPU time infrastructure. And we also provide a wall clock using the same information that's used by the time counter. This allows us to fulfill all previous D timer needs using PV interfaces, so we don't need tiny emulated devices.
The PBI piece are interesting because usually on bare metal, we use the local APIC to inject and receive the APIs from other CPUs. Since now, we can inject event channels into specific vCPUs. This means that we no longer need the local APIC. We can use the event channels in order to inject APIs into vCPUs,
and it removes the overhead of using the local APIC. This means that we have to take less VM exits basically because we don't have to read or write on the local APIC. In order to allow the PV HVM guests to do suspend and resume so we can do migration,
we need to rebind all the API event channels. On resume, we also need to rebind all the BIRQ event channels used for the PV timers. And we finally need to re-initialize the timer on each vCPU. This is done automatically by the suspend and resume code. And finally, we need to reconnect the frontends, the PV disk and the PV NICs
that we were doing on previous D10, discussing on previous D9, sorry, this has not changed. So we are also working on head right now in order to import the PVH, DOMU and DOM0, yes, sorry, yeah.
If you use the local APIC for the timer, yes, then you have to take VM exits because you are writing and reading from the local APIC. It depends on if you are using HPET or something like that, then yeah, it's also emulated so you need to take VM exits basically. I mean, when we use hypercalls, we also take VM exits because we have to call back into XEM,
but we only do one of them. If you read and write from the local APIC, it means that you actually do more VM exits basically. Yeah, so we are basically working now on getting PVH support into FreeBSD head,
both DOM0 and DOMU. Since PVH is based on PVH VM and it's quite similar, the work required in order to have PVH is not that much. What we need to do for DOMU was to add the PV entry point into the kernel. This is a different entry point than the native one.
And we have to wire this entry point into the FreeBSD boot sequence. This was actually quite easy because since we have a hardware virtualized memory management unit, we don't need to do any kind of PV specific page table set up or anything like that. So it's actually quite easy. We also have to fetch the memory map from XEM. As I said before, we have no emulated bios.
So the memory map is provided by XEM using a hyper-call. We also have to use the PV console because we have no emulated serial. And we have to get rid of the usage of any emulated devices in FreeBSD early boot. For example, FreeBSD I think was using the LAPIC timer
very early on boot before searching for other timers. It was used unconditionally. So we had to remove that and instead use the PV timer. It was the, well, the timer used was the E8254 I think
that's used on early boot always. Yeah. Yeah. So we had to remove that and use the PV timer basically. You had another question. Yeah. And then we also have to use the PV path in order to bring up secondary CPUs, which is quite easy.
It's much more easy that when you do it on bare metal from my point of view. And finally, it's also important to notice that all the hardware description comes from Zemstore because on PV, we don't have any kind of emulated, well, we don't pass any kind of ACP tables into the guests. So all the hardware description comes from Zemstore.
PV-DOM-U, oh sorry, PV-DOM-0 builds on top of PV-DOM-U basically and it's very similar to PV-DOM-U. The main difference is that on DOM-0, we actually have access to physical hardware. So Zen has to parse the ACP tables, sorry, FreeBSD has to parse the ACP tables
and notify Zen about the devices it's found. This is done because Zen has no ACP parser basically because it's a very big piece of code. It's very intrusive and we don't want to put it in Zen. So we did this kind of hack. Well, it's done since PV that the guest is in charge of parsing the ACP tables and notifying Zen
about the devices it finds. And basically, we also had to add a couple of user space devices in order to interact with Zen. That's done so the tool stack can interact with Zen and it can create guests and things like that. This is an architectural overview of the resulting work
that has been going on in head. Basically, we added a new Zen specific nexus that depends on the event channel implementation and then we put all the PV devices inside of what we call the Zen PV bus. This is done so we can have a more contained way of attaching the PV devices. They are no longer attached directly into the nexus.
They have their own bus. And the most interesting devices here are probably Zemstore, which takes care of getting the hardware description from Zen. We also have the grind table that's used to share memory with Dom0 or any other domain in the host. We also have the PV timer and the PV console. And we can see that from Zemstore,
we have several devices that hang off it. We have the PVD, we have PV disk, PV NIC and one we call the control interface. The control interface is used by Dom0 to send signals to the guest, for example, shut down. You write something to Zemstore and then the guest using the control interface realizes
that Zen is actually asking the guest to shut down. So we shut down the guest basically. And then we have the two user space devices use it to manage Zen. One is a brief CMD, which is used by the tool stack to issue hyper calls into Zen and to map memory from foreign domains. And we also have the event channel, which is used to register and receive interrupt
from Zen by user space devices. I think I've explained both devices. So I will skip this slide. Now I would like to explain a little bit more
about the tool stack. In Zen, we used to have two different tool stacks in the past, one was called Xem and the other one was called Xel. Xem has been deprecated for several releases and has been finally removed in what's now the current Zen repository. And it basically is the next Zen release
will ship without Xem. So now it means that Xel is the only tool stack inside of Zen and it's built on top of LiveXel, usually called LiveZen Lite, that is a library to interact with the hypervisor. The main features of LiveXel is that it provides a stable API,
so you can actually use that in your applications and you don't have to really care what's the underlying version of LiveXel because the API is stable. It's also coding C, that quite different from Xem, which was coding Python. It added a lot of overhead to the tool stack.
Now it's much more smaller and it consumes much less memory. And also, I would like to notice that we have a LiveBird driver built on top of LiveXel. So you can actually use all the nice features from LiveBird like birds. We can also integrate with OpenStack or CloudStack
using this LiveBird driver. Xel is the default tool stack. It's a small command line utility. It's very simple, in fact. And configurations for VMs are stored as plain text files. We'll see an example of one configuration later.
And Xel provides a set of commands to manage the hypervisor, but it's not gonna do anything fancy for you. It's not gonna set up your network or your storage. It's very simple. So if you are searching for someone more advanced that is able to take a snapshot or to move your storage around or things like that, you will have to look at a higher level tool stack like LiveBird, CloudStack, or OpenStack.
Here I have an example configuration file of Xel. In this case, this is a Linux PV domain. We can see that we are giving Xen the kernel and the RAM disk that it has to use. We are also setting the command line arguments that we are going to append to the kernel.
Basically, we are telling Linux that this is the right partition. We are also assigning four vCPUs to the domain and two gigabytes of memory. The name of the domain is gonna be test. And then we are attaching one PV network interface that's gonna be attached to bridge zero.
And then we are also attaching one disk to the guest. In this case, it's a profile basically. I think I will go for the conclusions and do the demo after so we can ask questions while we are doing the demo.
There are still some items that are pending inside of FreeBSD in order to get good Xen support. Mainly, we are missing multiple support in the FreeBSD loader. This is needed in order to put Xen as the zero. So we'll have to look into adding multiple support. We also have to improve the performance and compatibility
of our network of Netfront, which is the PV NIC interface. And we also probably need to look into improving the performance of Netback. And we are also missing some user space devices to interact with Xen. These are actually not that important and used by some functions, but you can run them zero without them.
We are missing what we call the gntdev that allows user space applications to map ground frames. And we are also missing the gntalloc, a lock which allows user space applications to share their memory with other domains. The patches have been published that are available on my GIT repository.
You can find them there. And there are also some patches for the tools, for the Xen tools. I think half of them have already been committed to Xen upstream, so there's not much work remaining there. I also have a couple of patches for QEMU and one patch for CBIOS that has already been committed.
And I think one of the QEMU patches have actually been committed also to the repository. So we see how the FreeBSD Xen support has evolved from HVM to PVHVM and finally to PVH. FreeBSD Dem Zero support is quite close. We are getting there.
And this means that you can actually use FreeBSD as a full feature virtualization solution using Xen. Now we will go for the demo. If you have some questions now or here, I have my console. It's actually very small. I wasn't exactly waiting on the Pixi Linux bootloader.
I'm gonna launch Xen and the FreeBSD Dem Zero. Now it's loading the FreeBSD kernel. This is Xen booting.
And now we jump into the FreeBSD kernel. Now the FreeBSD kernel is a scrubbing memory. And now it's booting. Any questions? Yeah.
Multi-boot is a specification about how you boot kernels. So we have to actually implement that in the FreeBSD bootloader. This is needed so that you can boot Xen. Well, you can boot Xen and then you can pass Xen, the FreeBSD kernel and the FreeBSD
meta-loader information. We can do that with Pixi Linux, but it's kind of ugly. I mean, we need to add a proper super to the FreeBSD bootloader. Oh, now it's booted. I'm gonna connect with,
you can see that it's a FreeBSD host. Now I'm using Excel to list the domains. Sorry. I'm gonna make this a little bit bigger, actually.
Yeah. So now we can see that Xen only has one domain, which is the DOM zero, which has two gigabytes of RAM. This box actually has eight gigabytes of RAM, but I've only assigned two gigabytes to the DOM zero. And it has also only two BCPUs assigned. And I think this box has eight BCPUs.
So now we'll launch a Debian PB DOM view. This is what we call PyGrub, which basically searches the image for kernels
and now it's booting the, oh, sorry. I have to initialize the breach.
And this is a Linux PB DOM view booting. And now we have the domain available. I'm gonna, for example, start the kernel build inside of this domain.
Now I'm gonna go back to the DOM zero. I'm basically detaching from the DOM view console. Now I'm again in the DOM zero. Now, if we look at the list of domains, we can see that we have DOM zero running and we also have the Debian domain running. If I use, for example, Excel top,
we'll be able to see the CPU usage of the two different domains. We can see that the Debian domain, of course, is running a kernel build. So it's actually using much more CPU than the DOM view. Then I can also create a FreeBSD DOM view.
In this case, this is a PBHVM DOM view. I have configured the serial console. We'll actually see output here of the FreeBSD build loader.
I don't know if you want to ask anything. I mean, it's not that interesting to see VMs booting.
I mean, it's not that new anymore. Maybe 10 years ago it was interesting, but right now, no. The problem with Birt.io is that it's designed
with the premise that the host is always able to access guest memory. That's not true on Xen because on Xen, we don't pass memory around. We pass what we call grand table entries, which means that it's actually a little bit hard
to get Birt.io working. We could get it working at some point, but we'll have to change the Birt.io specification in order to allow using grand frames instead of memory addresses directly. On Xen, the main thing is that the DOM 0 is just another guest. That's different from either KVM or Beehive,
where the host has actually access to all memory.
The better one that Xen could provide, it depends on the Xen version that you are running. If you are running Xen prior to 4.0, you will get HVM with PB drivers because that's the only thing that's available. If you are running anything newer than Xen 4.0, you will get PB HVM by default. It's all merged inside of the generic kernel,
so you don't have to compile a specific kernel or anything like that. I think I have, I'm gonna set this VM to also do kernel build. Now I'm gonna detach from that VM.
And again, we can see that we have the two VMs that are actually much more busy than the DOM 0. The DOM 0 is mainly only doing IO and behalf of the VMs. It's using block back and, well, in this case, only block back because we are not using network here.
And then, for example, I can shut down both guests. The Debian VM has already shut down. Now we are waiting for the FreeBSD one.
And now both VMs are gone. We can see on the list that there's no longer a VM. And finally, I will launch an open BSD VM. This one doesn't have PB drivers and I haven't configured the serial console, so we are going to attach to it using the VNC viewer, well, using a VNC connection.
And we can see here, open BSD booting inside of an HVM container, basically. This is using all the emulated devices from QEMU. Yeah? Can you explain what you mean by software virtualization versus hardware virtualization?
Hardware virtualization is provided by the CPU. It's actually inside of, it's using the Intel, BMX, and things like that. Software virtualization is done by QEMU or by Zen, basically.
And now we have the open BSD VM booted. We can ping Google, for example. So, well, if there are no more questions, I'm gonna destroy the VM.
So, yeah.
It depends on the memory you have. The only restriction on Zen is that you cannot page out memory inside of Zen. So if you have eight gigabyte of memory and want to create, well, and want to create one gigabyte guest, you will probably only be able to create six or seven of them, depending on the memory that you assign to them zero.
Then there's no limit on the cellular or anything like that. So if you have very small VMs, I think that recently they pushed the limit quite high and someone was able to get 2,000 VMs running on a host. This was a Linux host, of course, but it probably applies to previous D also.
Well, they were very, very small VMs with, I think, 32 megabytes of RAM only. Thank you for your attention then.
Thanks.