Save & Restore for bhyve - TIB AV-Portal

Save & Restore for bhyve

00:00

22

Berkeley Software Distribution (BSD)

Carabas, Mihai Tiganus, Mihai Anton, Flavius

Formale Metadaten

Titel

Save & Restore for bhyve

Alternativer Titel

Save-Restore feature for bhyve x86_64: Current status of save-restore feature for bhyve x86_64

Serientitel

The Technical BSD Conference 2017

Anzahl der Teile

31

Autor

Lizenz

CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.

Identifikatoren

10.5446/45258 (DOI)

Herausgeber

Berkeley Software Distribution (BSD)

Erscheinungsjahr

Sprache

Inhaltliche Metadaten

Fachgebiet

Genre

Abstract

bhyve is the current FreeBSD hypervisor, but it lacks of Save/Restore feature for a running virtual machine. The aim of this project is adding save/restore feature in order to increase bhyve usability in production environments. Save/Restore feature for a running Virtual Machine on bhyve is a very complex operation due to the fact that you must ensure consistency by correlating the state of the guest with the internal data about the guest that is hold in the hypervisor. In order to create a minimal feature to save the state of a virtual machine which is running on ramdisk you must do the following: save the virtual machine memory on disk (dump all its memory on disk) save the virtual machine internal state from the hypervisor (internal state from userspace and from kernelspace) save the state of the devices In order to restore the state of a virtual machine: create an empty virtual machine restore its memory restore the internal structures from files In the above presented steps, one of the most tedious operation was saving the internal state of the hypervisor in a file. There were a lot of structures having pointers to each other that needed to be reinitialized and this induced a lot of bugs. Another issue about saving the state of the machine was to completely freeze it to be sure that one gets a consistent state. To increase the user experience while saving the whole memory (imagine you have a 32 GB of RAM VM) we used the copy-on-write mechanism: we freeze the VM, mark all the pages copy-on-write and we unfreeze it. If the VM would write something, a copy of the page will be created, leaving in place the initial state. This is currently a work-in-progress project. We are at the step of restoring the state of the virtual machine, having trouble with the internal hypervisor registers, specific to a virtual machine. The project was also sponsored by Matthew Grooms (he paid two students to work with me at this project).

The Technical BSD Conference 20173 / 31

1

45:50

Debugging the debugger

2

37:05

The History and Future of Crash Dumps in FreeBSD

3

43:12

Save & Restore for bhyve

4

51:03

The Foundation of IT Infrastructure culture

5

37:46

Not simply pieces of string

6

35:24

FreeBSD ARM : Before Kernel

7

1:07:40

The FreeBSD Tool Chain

8

53:40

Centrally managing an ISP with NetBSD and PostgreSQL

9

50:07

From the outside

10

56:41

The Trouble with FreeBSD

11

49:48

12

52:23

Backpressure in FreeBSD I/O Stack

13

52:38

Abusing SSH for ZFS and Profit

14

41:07

Universal Userland

15

55:49

From microservices to monoliths

16

1:00:47

Closing session

17

58:29

The Realities of DTrace on FreeBSD

18

50:00

Hardening pkgsrc

19

48:12

The OpenBSD Web Stack

20

59:12

DTrace Internals: Digging into DTrace

21

26:47

FreeBSD as a Service

22

58:26

Where is IPv6 going ?

23

53:02

Improving the ZFS Userland-Kernel API: Channel Programs

24

54:06

Continuous Integration of The FreeBSD Project

25

1:06:12

The Technical BSD Conference 2017 - The opening session & keynote

26

58:56

27

24:27

IP Forwarding Fastpath

28

1:03:27

Oblivious sandboxing

29

44:01

Understanding NFSv4 ACL's

30

38:35

31

49:13

vmd: an virtual machine daemon for OpenBSD

Automatisches Abspielen

Sprache

Text

Bild

00:00

t-TestSystemprogrammierungArchitektur <Informatik>Notepad-ComputerRechnernetzCachingTreiber <Programm>EmulatorGamecontrollerMini-DiscARM <Computerarchitektur>Mechanismus-Design-TheorieEntscheidungstheorieElektronischer FingerabdruckHardwareMaßerweiterungIntelMigration <Informatik>Speicher <Informatik>Komponente <Software>Mini-DiscProgrammierungGamecontrollerProjektive EbeneVirtuelle AdresseAbstraktionsebeneSocketKernel <Informatik>BildschirmfensterIntelFahne <Mathematik>HalbleiterspeicherSoftwaretestVirtuelle MaschineMigration <Informatik>Framework <Informatik>Offene MengeGrundraumFunktionalPlastikkarteVerzweigendes ProgrammAggregatzustandPhysikalismusRechter WinkelSpeicher <Informatik>EmulatorAuthentifikationBefehlsprozessorPunktOrdnungsstrukturMechanismus-Design-TheorieAssoziativgesetzSpeicherabzugDatenstrukturMaßerweiterungMapping <Computergraphik>MultiplikationsoperatorZusammenhängender GraphDienst <Informatik>SpielkonsoleSchnelltastePhysikalisches SystemSchreiben <Datenverarbeitung>DatenfeldBimodulQuellcodeNotepad-ComputerGüte der AnpassungInterface <Schaltung>t-TestEntscheidungstheorieLastMaschinencodeTreiber <Programm>ComputerarchitekturNP-hartes ProblemWasserdampftafelCachingNichtlinearer OperatorStatistische HypotheseDifferenteSimulationSoftwareARM <Computerarchitektur>GoogolAusnahmebehandlungImplementierungFokalpunktDeskriptive StatistikKlasse <Mathematik>MereologieSeitentabelleSystemverwaltungNetzbetriebssystemStrömungsrichtungHardwareWhiteboardSocket-SchnittstelleXMLJSON

08:59

MetadatenROM <Informatik>Minkowski-MetrikKernel <Informatik>Mini-DiscRechnernetzInterface <Schaltung>Coxeter-GruppeBefehlsprozessorImplementierungUmwandlungsenthalpieDatenfeldCodierung <Programmierung>Rippen <Informatik>WasserdampftafelHalbleiterspeicherKonzentrizitätLaufzeitfehlerMinkowski-MetrikFront-End <Software>BinärdatenPunktAlgebraische StrukturVirtuelle MaschineKernel <Informatik>Prozess <Informatik>SystemprogrammMini-DiscDateiformatLastBinärcodeBefehlsprozessorUmwandlungsenthalpieMereologieOrdnungsstrukturMigration <Informatik>FunktionalSoftwareSpeicher <Informatik>AggregatzustandThreadLipschitz-StetigkeitProgrammbibliothekMechanismus-Design-TheorieCoxeter-GruppeSichtenkonzeptIntelSystemaufrufStrömungsrichtungFormation <Mathematik>MaschinencodeMultiplikationsoperatorDatenstrukturTesselationStandardabweichungNichtlinearer OperatorRechenbuchVirtualisierungZweiFrequenzGamecontrollerCybersexEinfach zusammenhängender RaumInformationVerknüpfungsgliedHauptidealLeistung <Physik>Rechter WinkelZeiger <Informatik>MetadatenProgrammierungInterface <Schaltung>Arithmetische FolgeEinsLesen <Datenverarbeitung>Äußere Algebra eines ModulsPuffer <Netzplantechnik>ÜbertragungsfunktionProgrammiergerätInterrupt <Informatik>Prozessfähigkeit <Qualitätsmanagement>DefaultComputeranimationDiagramm

17:53

Zeiger <Informatik>Virtuelle MaschineBefehlsprozessorROM <Informatik>Mini-DiscRechnernetzTelnetAggregatzustandTopologieSoftwaretestNotepad-ComputerMathematische LogikIterationMaschinencodeMixed RealityDatenstrukturSpeicherverwaltungBeschreibungskomplexitätMechanismus-Design-TheorieDatenstrukturSoftwaretestZeiger <Informatik>PunktGanze FunktionOrdnungsstrukturROM <Informatik>Benutzerfreundlichkeitp-BlockMini-DiscGewicht <Ausgleichsrechnung>DateiverwaltungRepository <Informatik>MereologieMathematische LogikEinfach zusammenhängender RaumMultiplikationAggregatzustandCASE <Informatik>NetzbetriebssystemProjektive EbeneComputerspielVirtuelle MaschineMehrrechnersystemVerzweigendes ProgrammHalbleiterspeicherImplementierungVerkehrsinformationMaschinencodeFahne <Mathematik>Wort <Informatik>DefaultRechter WinkelFaltung <Mathematik>TeilmengeWiderspruchsfreiheitBefehlsprozessorSoundverarbeitungDemo <Programm>EinfügungsdämpfungMigration <Informatik>MinimumMultiplikationsoperatorSoftwareSpeicherverwaltungVideokonferenzImpulsLesen <Datenverarbeitung>IterationTreiber <Programm>AdressraumLineare RegressionProzess <Informatik>Leistung <Physik>Prozessfähigkeit <Qualitätsmanagement>Produkt <Mathematik>Arithmetisches MittelFehlermeldungFunktionalVirtualisierungQuellcodeSchreiben <Datenverarbeitung>Support-Vektor-MaschinePasswortFront-End <Software>Mechanismus-Design-TheorieTelnetDatenverwaltungPhysikalisches SystemXML

26:48

IterationMathematische LogikMaschinencodeMixed RealityDatenstrukturSpeicherverwaltungBeschreibungskomplexitätMechanismus-Design-TheorieZeiger <Informatik>Produkt <Mathematik>GefrierenMaschinencodeSeitentabelleGamecontrollerTelekommunikationMomentenproblemHalbleiterspeicherBitSchreiben <Datenverarbeitung>SocketMultiplikationsoperatorWeb-SeiteServerEinfacher RingAggregatzustandThreadCoxeter-GruppeEinfach zusammenhängender RaumWarteschlangeOrdnungsstrukturEmulatorProzess <Informatik>Fahne <Mathematik>SpeicherverwaltungMereologieVirtuelle MaschineDatentransferPunktMAPClientDatenverwaltungVerkehrsinformationUrbild <Mathematik>Socket-SchnittstelleMailing-ListeLeistung <Physik>Physikalisches SystemWort <Informatik>Dämon <Informatik>BefehlsprozessorImplementierungEin-AusgabeEreignishorizontPerspektiveTreiber <Programm>XML

35:42

IterationMathematische LogikMaschinencodeMixed RealityDatenstrukturBeschreibungskomplexitätSpeicherverwaltungMechanismus-Design-TheorieZeiger <Informatik>Virtuelle MaschineVersionsverwaltungBefehlsprozessorZahlenbereichAggregatzustandMaschinencodeDatenstrukturWeb-SeiteFahne <Mathematik>Mini-DiscLastHalbleiterspeicherDatenfeldIdentitätsverwaltungParametersystemCOMHardwareSkriptspracheKonfigurationsraumRechenschieberBefehl <Informatik>Gemeinsamer SpeicherLesen <Datenverarbeitung>NP-hartes ProblemXMLComputeranimation

38:40

Mathematische LogikIterationMaschinencodeZeiger <Informatik>Mechanismus-Design-TheorieDatenstrukturMixed RealitySpeicherverwaltungBeschreibungskomplexitätInklusion <Mathematik>Demo <Programm>Hidden-Markov-ModellVideokonferenzRechter WinkelNetzadresseVirtuelle MaschinePunktGarbentheorieSpielkonsoleWorkstation <Musikinstrument>Vorlesung/Konferenz

39:54

Demo <Programm>RechenwerkVirtuelle MaschineSpielkonsolePunktAggregatzustandMini-DiscComputeranimation

40:35

MenütechnikDemo <Programm>Lokales MinimumVirtuelle MaschineFahne <Mathematik>Workstation <Musikinstrument>

41:15

Demo <Programm>KontinuumshypotheseE-MailBildschirmsymbolSichtenkonzeptSoftwaretestLoopZweiPunktVirtuelle MaschineProgramm/Quellcode

41:50

Demo <Programm>ZehnInklusion <Mathematik>Mathematische LogikIterationMixed RealityDatenstrukturBeschreibungskomplexitätSpeicherverwaltungMechanismus-Design-TheorieMaschinencodeZeiger <Informatik>Virtuelle MaschineAggregatzustandWorkstation <Musikinstrument>LoopPunktRepository <Informatik>Demo <Programm>Wrapper <Programmierung>E-MailGeradeParametersystemt-TestSoftwaretestHilfesystemPersönliche IdentifikationsnummerMini-DiscComputeranimationProgramm/QuellcodeXML

43:09

Inklusion <Mathematik>IterationMathematische LogikMaschinencodeMixed RealityAlgebraische StrukturMechanismus-Design-TheorieZeiger <Informatik>BeschreibungskomplexitätSpeicherverwaltungProgramm/QuellcodeXML

Transkript: Englisch(automatisch erzeugt)

00:05

Hello everybody, my name is Mihai Karabash and today I will present you the save and restore future for Beehive. Basically saving a virtual machine state on the disk and then restoring at a point in time. This work was done

00:21

by with my master students Mihai Tinoši and Flavius Anton. First of all something about us, we are all from the University Polytechnic of Bucharest. I was a PhD student, so two weeks ago my diploma arrived and right now I have my PhD. Currently I'm a teaching assistant and I'll be an

00:45

associate professor starting from October in the operating system field, system architecture and also system administration and networks. I've been implementing and coordinating three PhD projects for four years now, so I've started with Dragonfly BSD four years ago and then moved to Beehive for

01:03

BSD and since then I've been working on Beehive. Mihai Tinoši and Flavius Anton are two of my master students, they are now finishing the master and they worked on Beehive as their diploma project, as their master thesis project, sorry. Worked about eight months on this, starting from last summer, the

01:28

autumn and winter. They are also teaching assistant and operating system, they helped me with the homework and teaching classes. In the University Polytechnic of Bucharest we promote, I promote Beehive projects through the diploma and master

01:44

students. I have a lot of students that want to do general programming and this is very suitable for them. A lot of work has been done until now in university, a lot of drivers starting with the instruction caching future in

02:03

the Google Summer of Code project. The emulation of different drivers, NE-2000 and ATA, this controller, these two aren't yet merged into the current. Peter is working on this, he wants to create an abstraction for the device and then integrate this into them. If you're asking why these two old drivers

02:24

is because Beehive wants to support very old guests, like Windows, very old Windowses or very old FreeBSD, that doesn't have support for new network cards or disk controllers. Another interesting project was the emulation of HD audio

02:42

device driver, which is also in Peter's hands right now, so it's functioning but it has to be ported, it has to be merged into master. And the last two projects I worked intensively on are the porting Beehive on ARM, basically running FreeBSD guests on top of Beehive. I presented this project at

03:06

HIVSDCon. Right now, from HIVSDCon until now, we managed to fully run a guest and get its console on a QB board device. So we have a host, a FreeBSD host, and on top of it, we have a FreeBSD guest that is able to run and get the console.

03:26

This is under-reviewing to merge all the patches in upstream. And the last project I will talk about today is Beehive Savestore Future mechanism. Before starting, special thanks to Peter Grehan, he helped us in all

03:44

the design decision. We had a lot of blockers at the beginning, a lot of throughout code, and helped us integrate better the Savestore Future in Beehive and in FreeBSD. And also thanks to Matthew Grooms, his sponsorship, the students, during their master's projects,

04:05

enjoyed them, scholarships. Okay, let's start with having a brief description of Beehive and FreeBSD and then present the technical implementation of the Savestore Future. As you all know, the Beehive is the

04:21

FreeBSD hypervisor. It depends on the hardware extension Intel VTX and AMDV. It's very important because when doing Savestore steps, you have to save all the internal structures. So save the Intel VMX internal fields and also AMDV.

04:42

For now, we only save the Intel VTX. We left the path for AMD, but it's not implemented. Also, uses of the nested page tables. Also, it's very important when saving the guest memory. So these were two facts that were took into consideration when implementing this.

05:04

And can run various guests. Until now, we only tested with FreeBSD, the Savestore Future I'm talking about. Okay, everyone is talking about Beehive live migration. But as a prerequisite, we have to support checkpoints. What is live migration? Basically, checkpoint the VM on the source host,

05:24

migrate the memory destination, and start the virtual machine destination host. So this is a prerequisite for the live migration. So for adding the checkpoint to Beehive, we have to save the VM,

05:42

the virtual machine state to persistent storage. Basically, the memory and its state, internal state, hypervisor state. And when restoring, basically, we create a new virtual machine and initialize all the structures and the memory with the one that we saved on the disk. These are the big steps. Let's see what are the Beehive components.

06:03

So in the FreeBSD kernel, we have the vmm.ko kernel module, which exposes an interface for each virtual machine, dev, vmm, and the name of the virtual machine. Using this device, Beehive load basically loads

06:21

the kernel image into the memory. And with Beehive control, you can set or get various fields of the state of the virtual machine. Further, in order to run the virtual machine, you have the Beehive run executable, which does IOCTLs to the dev vmm, basically telling the

06:41

FreeBSD kernel to run the virtual machine. Whenever there are exceptions that couldn't be treated by the FreeBSD kernel, they are sent up to Beehive Run to treat them, like device simulation. Basically, the whole device simulation is done in Beehive Run, other from small parts,

07:03

which are very critical, and they are done in the kernel. Okay, let's see what steps we have taken to create a checkpoint. So we have these states we presented earlier. Further, we create a new device called devvmm, vmm underscore memory. Vmm is the name of the virtual

07:24

machine underscore memory. Then we add a new flag to Beehive control, name checkpoint, and the name of the virtual machine. The Beehive control is talking with a Beehive run via a unique socket.

07:42

Here, we had a problem in the last month when Peter integrated the Capiscum framework, which is blocking the opening of new sockets and so on. So right now, the Capiscum is turned off in our branch, okay?

08:01

Peter said that he would solve that when integrating this feature into the master. Then we send an ICTL to freeze all the virtual machine, basically freeze the CPUs. Then we are memory mapping this device.

08:22

And this is a key feature, the copy and write part. We created a new device in order to implement a new function. A new memory mapping function that would map the same physical memory as copy and write. So at this step, we have two different mappings of the same physical memory, okay?

08:46

And the second one is a copy and write mapping. What this means? It means that if from now on we start the guest and the guest would write something in memory, yes, it would be created a new physical chunk and written in there.

09:03

And this view of the memory would remain constant. Basically, at this point, we can let the guest run and save its memory, okay? So the checkpoint, if we have 60 gigabytes of RAM in a guest, okay, make the calculation how much does to save 60 gigabytes of RAM into the disk, okay?

09:25

Probably a lot of seconds, 10, 20. We cannot stop a guest for such a long period of time because it would lost all the network function and so on. And with this mechanism, basically we create a view of its memory at the point in time

09:43

and let the guest run and we start saving memory on disk, okay? So we have saved, basically we have access to the memory of the guest. Further, we need to save the internal state of the hypervisor, of the beehive.

10:01

There are three principal structures in beehive. Stroke VMX, which is Intel specific, okay? It has a lot of Intel registers and capabilities. Stroke VM, it's beehive specific and has information about the virtual machine. And the Stroke VR APIC is the programmable interrupt controller

10:25

virtual programming power controller. And we have to save different states from there. And they are basically get from the kernel using an IOCTL call.

10:43

At this point, we have in beehive run the Stroke buffer, which has all the internal state of the hypervisor, the view of the guest memory, okay, and the virtual machine. These are still in beehive run memory, they aren't saved on disk.

11:01

At this point, we start saving them on disk. Basically, VM memory, it's only copying from memory to a file. It's a very simple operation. Again, it's the view that is copied on the right of the guest. At that point, we make the checkpoint, not the current one, which may differ.

11:22

We also save the kernel binary file. We need this to restore the guest later, because beehive load needs the kernel binary. And the hardest part was to save all the structures

11:40

into the disk in a possible format. And we use the libxo library, and we save them in a JSON metadata file. Basically, here we have the structures, serialized all the structures. And all the state of the virtual machine is on disk, okay? Memory and internal structure, no devices at this point.

12:05

Okay, now let's see how can we restore a virtual machine having this information in place. We would put a new virtual machine empty with beehive load. Then beehive run, the process that runs the virtual machine,

12:23

reads all the metadata from the disk, only reads them in memory. Then we let beehive run do all the virtual machine initialization, okay? There are a lot of steps that need to be done, and we don't play with them. Just let them do the initialization. And after that, we just replace some of the saved information

12:44

from some of the structure we saved, not all. Okay, let's see in a graphical way this. From user space from beehive run, we have the VM restore function, which does an ICTL, which does an ICTL to the dev VMM device.

13:04

And it's called the VMM restore function from kernel, which I'll describe later. The VM restore then does the memory restore. You see that we do the memory restore from user space, because only the user space have access to the files

13:21

that are on disk, the kernel doesn't have. Okay, then we call the PCI restore. The PCI restore refers to all the VO, APIC, and some other internal structure. And after all these functions were executed, we basically start the vCPU threads,

13:41

and the virtual machine starts running. What does the VMM restore do in kernel? It's calling the VMX restore. The VMX restore is restoring Intel-specific registers, okay? So for AMD, we'll have MDV restore. We have to implement that.

14:01

AO, APIC restore, VLAPIC, and VHPET. All these are for the controller and for the IO basic IOD device and the timer. You can see that for each structure, we have multiple calls. These calls are for each vCPU.

14:21

All the restore structures needs to be executed for each vCPU. For example, here we had a virtual machine with four vCPUs. We see a VMCS restore four times. LAPIC restore four times again.

14:44

Okay, restoring devices. Until now, I haven't talked about devices. We want to do mainly virtual. This is because VirtIO is the default standard of device virtualization.

15:03

Parable virtualization, actually. And only with VirtIO, you could support virtual machine migration, okay? If you have an emulated device, even on KVM, you cannot migrate the virtual machine. Also on Hyper-V, if you have the emulated devices, not their synthetic one,

15:23

you can't live migrate. You have to have power virtualized VirtIO devices in order to be able to migrate. And this is why we concentrate only on VirtIO, okay? We haven't worked in VirtIO network interface. We save state and restore state. I'll show you later.

15:40

And we haven't worked in progress for the VirtIO disk. Basically, we save its own internal structure, but it's a problem in saving the actual data because the disk is very large, 100 gigabytes, 201 tera, okay? And when doing a save, basically, we save the state

16:00

and then copy the file, duplicate the file. And it's very time consuming to bring such a big file. An alternative would be to use the ZFS backend, okay? Having a ZFS in the virtual machine, issue a ZFS checkpoint and have the checkpoint in place at runtime very fast.

16:25

So I won't describe the saving and restoring the VirtIO devices. It's not a subject and I won't concentrate on this right now in a future presentation. What problems did you have? We have problems with restoring the VMCS.

16:41

So VMCS, it's a structure containing host and guest state for Intel, okay? The problem is that the same structure has registers mixed up between host and guest. At the beginning, we are just dumping the old VMCS

17:03

and then restoring it and the host crashed out. This is because we need to save and restore only the guest specific registers, not the host ones. And we have to get to each field and read them and write them like this.

17:23

So set of VMPTRLD, it's an instruction that says the current VMCS. Then you can read the current VMCS with VM read. So basically, we set the old VMCS, we are reading the instruction pointer,

17:42

then setting the new VMCS, the new virtual machine VMCS, and then writing the instruction pointer. And this we have to do for all the registers of the virtual machine. Another problem was saving the structures, okay?

18:00

We have two structures, struct VMX and struct VM. And both of them are having pointers one to each other, okay? After restoring them, you would see that these pointers aren't valid anymore. And you cannot make them valid because these structures are created again with other addresses. And we had to manually parse all the pointers of the structures and correct them.

18:26

This was another source of errors that took us a month to fix. Because a lot of these would crash the host and you don't have any means to debug it. Okay, what's the current status?

18:42

We basically managed to restore virtual machines with up to seven vCPUs. Unfortunately, our testing host didn't have more than eight physical CPUs and up to seven gigabytes of RAM. Also, our host doesn't have more than eight gigabytes of RAM.

19:00

And we use the read-only Virtio disk or a RAM disk in order to be able to save and restore it. Because we don't have the save-restore feature for the disk. And only one Virtio network device at this point. We use SSH ping telnet from the two virtual machines from the virtual machine while saving the state and then restoring it.

19:23

And you would see later that the connection is preserved. If you want to test it, I recommend the Git repo on the GitHub. Also, I talked with Peter and probably in two weeks he would create a SVM project

19:44

and import the GitHub repo in order to have an official branch to be able for all of you to test it. Future work. As I've told you earlier, save and restore the Virtio block device.

20:03

Eventually make use of ZFS. Also assess performance. How much time it takes to save a virtual machine to restore it. And also test with other operating systems after we finish the Virtio block device. Also, there are a lot of minor issues to be solved.

20:21

In the path of developing this feature, we are concentrating on developing new features, not solving the issues. There are some corner cases left that are documented, of course. Okay. So we basically have a behind save-restore mechanism that is working.

20:43

Unfortunately, we needed multiple iterations. And as Peter warned us, over the save and restore logic. Basically, in the first three months, we throw away a lot of code. We have written code. We throw away and so on until we manage to get a design that is very fast and is extendable for the live migration process.

21:06

Did you guys work on restoring due to a lot of pointers that was crashing the host? And after crashing the host, we experienced a lot of file system inconsistencies and so on.

21:21

So when crashing a host, mainly we needed to check the file system and so on. And the last point, a lot of time lost due to the virtual machine memory save and restore. The whole memory management amphibious is quite complex, especially the copyright stuff.

21:45

And we couldn't get help even from the community. We had some insight, but we didn't have a person who knew very well how beehive is using the VM virtual, sorry, the memory management for VSD

22:01

and how the memory management is working actually. And we have lost here a month or two, reading the code and trying to make a correct implementation. So thank you very much for your attention. If you have any questions, I also have a video with the demo. So okay, let's take the question and then the demo.

22:22

Well, for instance, you have the E-1000 in the UI partially,

22:45

and you can live migrate that. I mean, maybe- Sorry, yes, you can live migrate that. Yes, I've meant, sorry, I've meant for PCI pass through and so on. So you can live migrate, you can live migrate emulated devices,

23:06

but it's not the scope of the project because the emulated devices doesn't offer you a good performance. So if you want to use a virtual machine production, you would use power virtualized drivers or password devices.

23:21

It's a mistake I wanted to say about password devices. So if you have a device that is passed through, even running with multiple virtual functions that is passed to a virtual machine, you cannot live migrate that. Okay, and the reason why we didn't test and we probably won't implement the E-1000 migration is that we have to implement the save-restore logic,

23:45

especially for the E-1000. It doesn't have a common backend with anything else. So for example, here in VirtIO, we have a common backend, and a lot of work we have done on the VirtIO network helped us in saving the VirtIO block device.

24:01

Sorry for the mistake. Yes, you're right. Other question, please. So when you take the, is there like a second incremental one later? No, at this point, when you say checkpoint, at that point, you'd save the entire memory disk and state at that point.

24:23

We just only do this. So like in VMware, we write and save the virtual machine checkpoint. So we can return on to there. For live migration, we'll have to do what you are saying. Basically, create a checkpoint, start moving the memory to other host,

24:41

and then creating basically another checkpoint and start moving them until you have a few memory unmodified, and then you freeze the guest and move the other part. So this was talked truly with Peter, and it is in plan, but it is for live migration, not now.

25:03

But this is, sorry. Yes, of course. And this is why we implemented the COW. So initially, we didn't have any COW, we were saving the memory, but it wasn't useful for an efficient live migration. Okay, thank you.

25:21

Other questions, please. Yes, so actually, from that, we said that we will look on the,

25:42

okay, so you are talking, you are putting the question that we are taking faults when checkpointing. After checkpoint, we shouldn't remove all the copy on write flags. So you are saying that during the checkpoint,

26:00

we will have a lot of faults from the guest? Yes. Okay, this is true. And this probably would impact a performance of a guest who is doing a lot of writes. But yes, you cannot avoid that.

26:44

Yes, but how do you get the defects? So how do you get the CBAZ to tell you, to notify you that that page was modified?

27:04

So you said to, I don't get it.

27:20

So you say that to do a copy on write, incremental copy on write. So take the four gigabyte and then the next gigabyte and so on. Okay, because at the moment in time, you have a state that you have to save it truly from the top, okay.

27:57

Okay, so you are staying to, I don't know.

28:11

So the dirty bit you are talking about, it's for the page written, but I don't think they would be concept.

28:34

So you are talking about managing the dirty bits here and mangling with the memory management stuff of the FreeBSD.

28:44

So that dirty bits are used by the memory management. Sorry, yes, it does EPT. Yes, actually Peter added a few bits of code

29:03

in the memory management of FreeBSD with the EPT flags. So basically all the level two page tables are built by the FreeBSD memory management system. Why? Because we want to be able to over provision. So at any point in time, we can create more guests than,

29:25

sorry, you can create guests that have more memory than your physical memory and you're able to swap because you have the memory management there and it's doing all the work for you. So right now the memory management of FreeBSD is aware of EPT flags and so.

29:44

Okay, so in the early days, all the EPT page tables were written by hand and all the memory was hardwired of the guest. But I guess three years ago, they changed its implementation,

30:01

three years and a half. This was the first product they did after launching Beehive. And this is why I don't know how to do this and not screwing up something in there. Because we struggle a lot with the copy on write, again, to make it right.

30:22

Thank you for your input. We'll look into this. Other questions, please? Yes?

30:41

Basically, we save the memory of the virtual machine, okay? So it states, this is only what we see. So from the host perspective, we only see one process, the Beehive process, which runs the virtual machine. Okay, and it saves its memory. The socket is in that memory.

31:01

So when we restore the memory, we restore every part of it, including sockets, page tables and so on. The problem would be with the other end of the socket, because if it takes too much time after restoring it, the connection will be closed, okay?

31:20

We haven't started truly what's happening after a lot of time after restoring with that socket. But that socket should be somehow cleared, because the connection, the guest would try to communicate on the socket and we don't manage, and that socket should be destroyed automatically. So we don't even involve in there.

31:42

Okay? Okay. As part of the previous question, this controller has a state which changes requests are implied. Are you draining requests before saving or you are saving requests in progress? I know about this issue. I don't know, Flavius, what have done.

32:02

So actually, Flavius has worked on this feature, and I've talked with him about what doing with the requests that are pending, but actually, he freezes the vCPU. So actually, somehow it stops the drivers, but it waits for all the requests to finish in order to have a complete state.

32:23

But it could be some chances that we are not doing this very well. So it wasn't tested very well. This is why I didn't present the virtio here, because Flavius managed to do this a week and a half or two weeks ago. So it's a new feature, let's say, for us.

32:44

And yes, we have that on the to-do list, so to verify. So the controller receives some kind of event, okay, now you should synchronize state or maybe even send it to guest to make guest free some memory which is not useful to reduce dumping size,

33:01

or it happens just completely unrelated to virtual machine and processes. So basically, this happens unrelated to virtual machine. We do not communicate to the virtual machine. It would be better if you have a daemon, like all the hypervisor have VMware, Hyper-V and so on, that runs inside the guest

33:22

and we could communicate with that daemon and say, okay, right now we will start stopping all devices and so on. But we aren't there yet. So we are doing all this work outside of the guest. Basically, this is why we are using the virtio

33:41

because on the virtio we have a lot of control being paravirtualized. We stop the guest but we let the controller to finish all the requests and then set the state. Okay, so sorry for you. This is another reason why we use the virtio.

34:01

We have a lot of control on it. On the E-1000, for example, we have control of the requests. So for example, we know when the request has completed and also basically it's a client server communication

34:24

that we implemented virtio in the guest and the host. Okay, and when freezing the guest, we wait for all the requests to, sorry, we drain all the requests and then save the state. In the, in an emulation part, I don't know if...

34:43

It's the same, virtio has a ring. You have a thread that drains the queue. Yes. You can drain that but also the E-1000 has a ring so you can have a thread which when you stop the guest

35:01

you can drain all the present transmission packets. I think it's the same, however, your point is clear. Okay, I know we had some issues, but okay. For sure, being an emulator device is more complicated, right?

35:22

No. So more registers, more stuff, and of course in general more problems. And performance issues, yeah, okay. Okay, other questions, please? Yeah. So in the state, on the guest state,

35:43

you have all the configuration parameters for how we have it started? Yes. So like the mutation flags of the behind binary or is it just internal state or? It's internal state, actually.

36:00

So when you are sending, when you are sending the parameters on the .com online, basically some internal structures are populated with the memory amount to the number of PCBs and so on. But we cannot reproduce all the flags from inside beehive, okay?

36:22

And for now, we are basically running the guest with the, we are using that beehive load command with the same parameters, okay? So this would be in charge of some external script of saving and restoring, okay? We aren't doing, we aren't saving on disk any of this.

36:40

We can do this, but this, again, this would be in charge of an external script, not our work from inside of beehive. If you can install a different machine, I guess they tried. Yes, they tried, but they didn't try on different,

37:04

so they were two identical hardware machines. So it should work because we are, sorry? No, we are taking from the VMCS

37:22

only the fields that we are interesting. So at first, yes, yep, that won't work, yes.

37:43

No, that won't work because that depends on the CPU. Yep, let me look on the code. I guess the new version of the code is doing this

38:01

because we had trouble with saving the page. And I don't, I guess that the code is old. Honestly, because we had trouble in loading the VMCS on the same machine of the same CPU because it had host registers that were too old

38:24

from the state when we saved the virtual machine. So basically we are taking from the VMCS only the guest resource. Let me look on the code later and I will tell you. So directly it should work. If not, we have to fix this.

38:42

Other questions? Okay, it's a video. I don't think, it's not, yes, I know, but it's not.

39:09

Okay, right now we are loading a guest and running it. It's a virtual machine called this VM. Okay, okay, right now we are getting

39:29

the virtual machine IP address and go to other tab and SSH into that virtual machine. So at this point we have to cancel the console of the virtual machine and the SSH station.

39:55

There it sets some commands. So it writes something on the console. That is all, okay.

40:05

RootedFuge is the host. At this point, we will suspend the virtual machine.

40:20

As you can see, the virtual machine disappeared. Okay, so it was the console. When it's suspended, the virtual machine died. And also we saved all its state on the disk. At this point, we load again the virtual machine, an anti-virtual machine. And when hitting behave run, we have a flag minus R

40:43

to tell what checkpoint to use when running the guest. And you see there some command not found because it tried to execute. And also the SSH station is still up.

41:04

Okay, so you have SSH station working.

41:24

And another test with another checkpoint. So basically it runs a while loop which displays the date every 0.2 seconds. You can see here we suspend again

41:41

and create another checkpoint. So this is another checkpoint of the virtual machine. You can see there the virtual machine died. And you can see the SSH station that it doesn't show anymore. And now we restored the virtual machine. At this point, the SSH station is still working

42:01

and displaying the while loop. So basically we restored the virtual machine from the state it was. Okay, this was the demo. Again, you have the Git repo. It was rebased with the master, I guess, one week ago.

42:22

So we are in line with the new modification of FreeBSD. And we can tell this future. And you could see that all the parameters were in there. Basically, we have to know the parameters. You have to create a wrapper on top of this saving in another file on these parameters of which you have run behind.

42:43

Okay, any questions? Thank you very much for your attention and for your questions. Okay, if you have any questions, you have my email or my student's email

43:00

and can send us a pin. We have any problems in testing it. And we'll help you bring it up. Thank you.

Empfehlungen