QCOW2 in VMD - TIB AV-Portal

QCOW2 in VMD

00:00

16

Berkeley Software Distribution (BSD)

Formale Metadaten

Titel

QCOW2 in VMD

Untertitel

Snapshots 'Til the Cows Come Home

Serientitel

The Technical BSD Conference 2019

Anzahl der Teile

34

Autor

Lizenz

CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.

Identifikatoren

10.5446/45168 (DOI)

Herausgeber

Berkeley Software Distribution (BSD)

Erscheinungsjahr

Sprache

Inhaltliche Metadaten

Fachgebiet

Genre

Abstract

As of OpenBSD 6.4, VMD supports QEMU's QCOW2 disk format. This talk will go over what QCOW2 is and how it's implemented internally. Until recently, OpenBSD's VMD only supported raw disk images. Raw images are large, lack snapshot support, and are clunky overall. In OpenBSD 6.4, support for QCOW2 disk images landed. QCOW2 is a copy on write disk format that supports lazy growth and external snapshots, among other features. It does this by keeping a page-table like cluster map. This keeps space use down, and allows a lot of nifty snapshotting features. But there's no such thing as a free lunch: QCOW2 images pay a price in both performance and robustness. In this talk, I'll give an overview of QCOW2 features before making a sharp turn into into the details of the disk format, how to use it, and how I implemented it on OpenBSD.

The Technical BSD Conference 20197 / 34

1

1:11:43

Measuring Performance on OpenBSD

2

53:35

3

55:19

Unveil in OpenBSD

4

54:32

Automated Firewall Testing

5

56:57

Road Warrior Disaster Recovery

6

12:37

The Technical BSD Conference 2019 - The opening session

7

48:37

8

42:39

Porting NetBSD to the RISC-V

9

44:27

Network booted OpenBSD Workstations

10

1:02:44

Netflix and FreeBSD

11

51:38

MQTT for system administrators (and for the IoT)

12

58:31

Master & Minions or the Dream of Python Automation

13

1:01:01

Improving security of the FreeBSD boot process

14

36:27

15

50:17

Fuzzing the kernel

16

45:03

FreeBSD from a Linux developer's perspective

17

51:07

Frankenstein's Disk Drive

18

59:27

Do you remember legacy IP?

19

55:09

Developing a FreeBSD driver using Test Driven Development

20

47:59

Building a security appliance based on FreeBSD

21

1:00:15

CheriABI: Hardware enforced memory safety for FreeBSD

22

41:50

Twenty Years in Jail

23

35:10

Porting Go to netbsd/ arm 64

24

50:20

OpenBSD: add VMM to 'packer' (porting Go software)

25

52:48

Modern BSD Computing for Fun on a VAX!

26

46:11

How ZFS Snapshots Really Work

27

45:58

Heart ticking for a guest running on FreeBSD ARM hypervisor

28

52:03

Diving and OpenBSD

29

1:02:53

Cross-Platform Parallel Regression and Performance Testing of, and with OpenZFS, FreeBSD and bhyve

30

59:33

Building an accessible OpenBSD laptop

31

1:06:42

The Future of OpenZFS and FreeBSD

32

40:04

Adventure in DRMland

33

25:11

Design and verification of the TLS 1.3 handshake state machine in LibreSSL

34

51:10

Managing Jails with Ansible

Automatisches Abspielen

Sprache

Text

Bild

00:00

Baum <Mathematik>Mini-DiscDateiformatWeb-SeiteMehrrechnersystemDatenstrukturOvalInnerer PunktBildgebendes VerfahrenVideokonferenzMAPQuaderNatürliche ZahlPunktTabelleZeiger <Informatik>DiagrammDateiverwaltungBitFunktionalRohdatenGarbentheoriep-BlockDatenstrukturBootenAbstandMini-DiscElektronische PublikationCodeWürfelImplementierungBestimmtheitsmaßZahlenbereichZählenEinsKugelkappeEinfach zusammenhängender RaumSoftwaretestDifferenteDatensatzSchnittmengeAutomatische IndexierungGeradeMetadatenWeb-SeiteMultiplikationCachingCASE <Informatik>GraphfärbungDateiformatProjektive EbeneWeg <Topologie>Ordnung <Mathematik>WarteschlangeInterface <Schaltung>Schnitt <Mathematik>DruckverlaufATMSeitentabellePerfekte GruppeZeitrichtungInjektivitätMultiplikationsoperatorMinkowski-MetrikKalkülMechanismus-Design-TheorieRechenbuchKette <Mathematik>Derivation <Algebra>Offene MengeDifferenz <Mathematik>InformationsüberlastungGemeinsamer SpeicherBitfehlerhäufigkeitLie-GruppeDatumsgrenzeMereologieSchreiben <Datenverarbeitung>ComputeranimationXML

09:39

Mini-DiscOvalInnerer PunktBaum <Mathematik>VersionsverwaltungE-MailTabelleMinkowski-MetrikCASE <Informatik>Gemeinsamer Speicherp-BlockMini-DiscTabelleZahlenbereichLesen <Datenverarbeitung>HalbleiterspeicherSchnittmengep-BlockZeiger <Informatik>AggregatzustandSchreiben <Datenverarbeitung>MomentenproblemMetadatenBildgebendes VerfahrenEinsElektronische PublikationZählenVersionsverwaltungBitVerschlingungSpeicherabzugWeb-SeiteCASE <Informatik>Wurzel <Mathematik>CachingPatch <Software>Rechter WinkelDateiverwaltungDickeChiffrierungTopologieAutomatische IndexierungCluster-AnalyseVerschiebungsoperatorMAPPuffer <Netzplantechnik>Nichtlinearer OperatorKette <Mathematik>CodeSystemaufrufMotion CapturingWarteschlangeE-MailDateiformatLastGemeinsamer SpeicherDatensatzFunktionalIndexberechnungComputeranimation

19:07

Baum <Mathematik>CASE <Informatik>SpezialrechnerMini-DiscMultiplikationCachingMaßerweiterungDifferenteKette <Mathematik>HalbleiterspeicherProgrammfehlerTabelleVersionsverwaltungMetadatenBildgebendes VerfahrenMini-DiscBildschirmfensterMereologieRohdatenVererbungshierarchieRechter WinkelSystemzusammenbruchSystemaufrufDatensatzMaßerweiterungMultiplikationsoperatorPunktBitDifferenz <Mathematik>Schreib-Lese-KopfLesen <Datenverarbeitung>Schreiben <Datenverarbeitung>Elektronische PublikationGamecontrollerKartesische KoordinatenMAPMailing-ListeMehrrechnersystemWarteschlangeGerichteter GraphComputeranimation

23:58

Baum <Mathematik>UmwandlungsenthalpieLateinisches QuadratMini-DiscBildgebendes VerfahrenElektronische PublikationProzess <Informatik>InformationsspeicherungDateiformatQuick-SortParametersystemMaßerweiterungEmulatorRohdatenTypentheorieHardwareZählenGebundener ZustandWiderspruchsfreiheitMereologieTemplateCASE <Informatik>Schlüsselverwaltungp-BlockE-MailNichtlinearer OperatorChiffrierungTransportproblemSurjektivitätOrdnung <Mathematik>Bitmap-GraphikSystemplattformHalbleiterspeicherWarteschlangePhysikalisches SystemSystemzusammenbruchMAPDatenkompressionUniformer RaumMailing-ListeMehrrechnersystemRechter WinkelMetropolitan area networkMobiles EndgerätVerkehrsinformationDatensatzGesetz <Physik>Computeranimation

29:16

Baum <Mathematik>Demo <Programm>ZweiMehrrechnersystemComputeranimation

30:55

Baum <Mathematik>BildschirmfensterReverse EngineeringVideokonferenzTermCodeVerzeichnisdienstGamecontrollerRechter WinkelComputeranimation

31:47

Baum <Mathematik>Rechter WinkelGamecontrollerGüte der AnpassungBildgebendes VerfahrenTropfenSchnittmengeWort <Informatik>BootenMini-DiscOffene MengeKugelkappeExogene VariablePunktMultiplikationsoperatorComputeranimation

36:03

Elektronische PublikationInterprozesskommunikationSCSIBaum <Mathematik>SpeicherabzugWurzel <Mathematik>SupercomputerRadikal <Mathematik>TermMini-DiscTouchscreenComputeranimation

36:41

Baum <Mathematik>Elektronische PublikationBildschirmsymbolTexteditorMetropolitan area networkIntelLoginOperations ResearchGesetz der großen ZahlenBootenMultiplikationWurzel <Mathematik>TropfenSpeicherabzugHill-DifferentialgleichungZellularer AutomatTuring-TestLokales MinimumSupercomputerNP-vollständiges ProblemPolstelleMini-DiscGamecontrollerChiffrierungSchnittmengeBildgebendes VerfahrenElektronische PublikationBootenLesen <Datenverarbeitung>Virtuelle MaschineKernel <Informatik>QuaderProzess <Informatik>RechenschieberMessage-PassingInstallation <Informatik>MathematikWeg <Topologie>TemplateQuick-SortNetzbetriebssystemDateiverwaltungp-BlockFehlermeldungInstantiierungRechter WinkelProgrammbibliothekGraphische BenutzeroberflächeFitnessfunktionDifferenteHilfesystemMultiplikationsoperatorComputerspielOffene MengeKette <Mathematik>Physikalisches SystemSoftwareEinfach zusammenhängender RaumPlastikkarteGanze ZahlComputeranimation

42:46

Elektronische PublikationSpeicherabzugTuring-TestHill-DifferentialgleichungBaum <Mathematik>Lokales MinimumSupercomputerInklusion <Mathematik>DateiverwaltungElektronische PublikationPhysikalisches SystemVerzeichnisdienstPatch <Software>Bildgebendes VerfahrenCodeSchreiben <Datenverarbeitung>SoftwaretestHalbleiterspeicherDifferenz <Mathematik>DateiformatMini-DiscPunktSystem FSpieltheorieBenutzerfreundlichkeitSoftwareentwicklerLineare RegressionKernel <Informatik>Physikalischer EffektOffene MengeSCSIClientKonfiguration <Informatik>Minkowski-MetrikWurzel <Mathematik>Projektive Ebenep-BlockInstallation <Informatik>DatensichtgerätDateisystemVerschlingungWeb SiteSehne <Geometrie>DifferenteComputeranimation

48:34

Computeranimation

Transkript: Englisch(automatisch erzeugt)

00:13

My name is Ori, and I'm here to talk about QTau2 and BMD. So I'm going to start blathering about a few different sections, starting off with what

00:22

actually is QTau2, how do I use it, and then the main part of it, which turns into a code review of the implementation, kind of. OK, so before I start off, are there any questions? So do I prefer to go? Feel free to stop me at any time.

00:41

I'd rather deal with confusion or people getting lost early rather than late. And if there's an interesting sidetrack, I don't mind going down a little bit. Is that a challenge? None. So, why did I end up writing QTau2?

01:01

Well, last year at BSD-PEN, Peter Hessler was talking about how he really wanted it, and he fooled me into looking at the spec. I looked at the spec. And I went, oh, that looks like a weekend project. A three-line diff? Yeah, three-line diff. A few weeks later, I had a diff. It was 700 and some lines, so a little bit off, but not that bad.

01:26

So if you want to fool me into doing stuff, I think I've given away the secret. Anyways, so why do we care about QTau2? Well, before I go to QTau2, let me talk a little bit about what we had before.

01:40

Raw disks. Here is a diagram of the structure of a raw disk image. It's just a giant chunk. Every, it's a one-to-one mapping. So you have a, you're writing to offset 1000 in the disk. You write to offset 1000 in the raw disk image. Every byte that you could potentially use is represented in the raw disk.

02:05

So this actually has a few upsides. It's pretty fast. It's pretty reliable. So you don't have to worry about any higher-level structure. You're not seeking around to figure out where you need to do the writes. And any failure mode is going to be your file system shitting itself.

02:24

So they're good that way, but they've also got other problems. They're wasteful, and they don't have snapshots. Everyone loves snapshots. So QTau2 solves those two issues. QTau2 is the native Qemu disk format, which means that if you are using Qemu,

02:40

you'll be able to just boot up your Qemu disk in PMD, assuming all the other stuff in the VM actually works. We'll be able to read the disk. I don't promise that it'll boot. It does copy on write, so it's a little bit of a lie there, but it mostly does copy on write.

03:00

So that means that if you have snapshots within the disk, the new data will be copied before it's modified. It grows on demand, so you start off with a very small disk image, and as you start putting more stuff into it, it gets bigger to accommodate. And it does snapshots. It actually does two kinds of snapshots. All of this sounds a little bit like ZFS.

03:22

And if you have ZFS, why would you want this? Well, we're open VSD, we don't have ZFS, so we want this. Cool. So, as I said, there are two kinds of snapshots. There are internal snapshots, which are snapshots within the QTau2 disk image,

03:41

and they're derived through external snapshots. So these are where you basically chain together disk images. You have your space image, and then you have a derived image that points back at it, and you can chain these as far as you want, so the derived image can point to a derived image that points to a base image, and so on and so on.

04:01

So, now that you've got a general idea of what it is, of what QTau2 is and what it does, I'm going to go into how you use it with VME. So, there's actually not much different. You have your, whatever worked with a raw image will probably just work out of the box with QTau2 image.

04:22

So, the biggest difference is when you create the disk images. So, you can either do a vumpad create QTau2 colon test.image dash s32. The QTau2 prefix says, make this into a QTau2 image instead of a raw image.

04:40

Alternatively, you can just give it a dot QTau2 suffix, and we'll decide this is a QTau2 image, and make it that way. Nothing surprising here. This is all, hopefully, pretty uninteresting. And as you can see, after we create the test image, you've got a 256 kilobyte disk image.

05:02

It's there. You can use it. You can do stuff with it. So, for example, you can start at the end, point at the disk, and here I'm being way over specified. So, you can, you do stuff at the end. In this case, I believe what I did was install it on the BSD on it,

05:23

and you can see it's grown to one and a half gigabytes of disk space. Grow on the mechanism, as advertised. If you want to do a snapshot, you can do vumpad create QTau2 derive.image dash b test.image dash b specifies your base image.

05:42

As you can see, the derived image points at the one and a half gigabyte base image, but it's only 256K. And as you write to it, it will grow. And it essentially just contains the diff off of the base. So, you can continue doing stuff with the VM.

06:02

And when you're done with it, you can create a new derived image. That's how you roll back a snapshot. OK, so, oh, and the last one is in-disk snapshots. No one's bitched me enough for me to implement it in the VMcouple command. So, use Q&A.

06:20

We're compatible. That's the nice thing about using a well-defined format. So, here's where the code review starts. So, how does it work? Well, at a very, very high level, QTau2 disk image is essentially a page table. And you have, in the QTau2 documentation,

06:44

they call them clusters. Basically, you add a page table and grow the disk for fresh pages. So, here's the diagram. You've got the metadata chunk, which points to two different tables. You've got the L1 table here,

07:02

and the L1 reference cap table here. So, when you look something up, you go walk down the L1 table, find the L2 chunk that it refers to, and the L2, and the level two for the reference cap actually contains the cap. A 16-bit number for each chunk of data in the reference cap.

07:22

And then you have the data pointing down, and then the L1, L1 chunk, sorry, the L2 chunks contain the pointers to the data blocks, which may or may not be in order of disk. Within disk snapshots, you get a very similar structure,

07:43

but you have multiple kinds of data blocks that you walk down to the data. So, you start off with, this snapshot would point to this set of L1 table, which points to L2 tables. The L2 tables can be shared.

08:02

In this diagram, they're not, because the arrows got confusing. So, yeah, and then the data blocks are also mostly shared. This one is not, hence the different color. So, any offset can be kind of broken up into three different sub-chunks.

08:25

You have the cluster offset, which is how far into the data you're reading. So, for example, if you want to read only these bytes, then you have a cluster offset that points into the middle of the chunk. You have the L2 section,

08:43

which points to how far off into the L2 block you go. So, this one would have an L2 offset of zero. And then there's the L1 offset, which points into the L1 table. I think I forgot to mention, the L1 table is always contiguous.

09:05

So, how do we plug this in? Well, when I started with QCow2, I saw that we were just directly doing p-read and p-write directly into the raw disk, which makes perfect sense. It's a simple interface. It does exactly what you want, almost.

09:24

And it works. So, I basically took that and said, well, I just want to have everything called the same functions, but I want to be able to overload them. So, I created a BERT IDLE backing, which contains a pointer to some data.

09:42

In the raw disk format, it contains a pointer to the file descriptor. In QCow2, it contains a pointer to the QCow2 state struct. You have p-read, p-write, and close, which all map directly to the file system and operations. p-read reads a buffer of length length from the offset off.

10:03

p-write does the same thing, but instead of reading, it writes. And close cleans up the resources, flushes whatever is in memory to disk, and does any cleanup that needs to be done to make sure that the disk's consistent when you're finished. The QCow2 disk image always starts off with a header,

10:22

and we read out all of the data that we care about from that. So, it starts off with a magic number that identifies it as a QCow2 disk version 2. It has a version number just because the magic number isn't quite enough, for some unknown reason. It's got the backing offset and backing size,

10:40

which is your back pointer to the chain disk. This is all, I believe, it's little-endian on disk. I could be wrong, it could be big-endian.

11:01

And then we do the swapping when we read it in. So, the cluster shift lets you actually define how big each cluster is. The disk size is exactly what you think, how many bytes the disk is. The crypt method is the encryption that we use. We don't actually support any right now, but patches welcome.

11:23

The L1 size and L1 offset say where to find the L1 table. The refcount offset and refcount size says where to find the refcount tables, and the snapshot count and snapshot size is where to find the snapshot table, which we don't really do much with at the moment. Then there's the incompatible features, which are features that, if they exist, we can't load them.

11:43

Autoclear features, which are features that, if they exist, you can use them. If they don't exist, you need to turn them off, because you may corrupt some caches or something. And then there's the compatibility features, compatible features, which you can just pretend, you can just ignore.

12:04

And then there's the refcount size, because someone may want more than 65,000 snapshots. The core of the QTOW2 disk image is the translate function, which is really the only actual code I'm going to go into.

12:21

It takes the offset into the disk and figures out where to go. So it starts off by grabbing the L1 offset, dividing by the cluster size and the L2 size, which, if you think that, gives you the index into the L1 table. Indexing into the L1 table gives you the offset of the L2 table.

12:42

The offset of the L2 table, there's a little bit of extra stuff in there, so specifically there's one bit in there that says whether you need to copy the data or not. So this means that, well, if that bit's set,

13:02

you don't need to look at the reference count at all. If the value is zero, we don't have that, and we just return zero. Otherwise, we get rid of that bit and read the offset off the disk from the L2 page.

13:21

You notice that the L1 table is in memory for performance, the L2 table because we end up seeking around a lot and we're reading off disk. We're not reading the whole cluster, but just the data that we care about. We're hoping that the buffer cache will actually keep the data in memory. Thanks, Bob, for fixing that.

13:42

And then we have the cluster at the end. So how do we update the reference counts? You may notice that we actually didn't pay any attention to this. Well, we're lazy about it. All we care is that the qcount2 bit is set, or the copy-on-write bit is set, or not.

14:02

If the copy-on-write bit is set, we copy the cluster and set the reference count to one. If it's not, we just write in place because nothing else is sharing it. Nothing else cares if we're modifying it or not. Well, then why have a reference count? The reference count is updated with snapshots,

14:20

but it should be decremented when we copy something on write, and that will mark it as unused. We don't do anything with that, which means that if you create an in-disk snapshot, modify the data, and delete it, we'll link the clusters that we're reading.

14:41

That should be fixed. But as far as I'm aware, no one's using in-disk snapshots, so it's not currently a big deal. So I'm just going to go through a few set operations to make sure that everyone's on the same page.

15:01

First off, case zero, reading the data. Let's just say that the data is already there, and we just need to know where it is. So we've got our simple disk over here. It's got one data cluster with an L1 table,

15:21

an L2 table pointing at that, and then a reference count. So all we do is we walk down, find the L1 table, find the L2 table, grab the data out of the block. Now what if the data is mixed? Well, the disk is virtual,

15:41

which means that the data is always there, it's just what is it if you haven't written to it. So we fill it in with zeros. You go through, you find the L1 table, you find the L2 table, you go, oh, there's nothing in this cluster. Well, our read must return zero.

16:07

So for writing, let's just say that the data is there, it's not shared, and all we need to do is update it. So we walk through the first cluster,

16:22

or we walk through the metadata, find the first cluster, find the L1 table, find the L2 table, and write to the data. The data is there, we just need to find the cluster and write it. There's nothing too interesting about that.

16:45

Now, what about writing shared data? Let's just say that this is in a snapshot, and someone has, and we're writing to a new snapshot. Well, the data is there, someone else can see it, which means that we can't modify it in place.

17:01

So we copy the data on write. So when writing to the, so in this case, we go through, we find the data is there, we start from the metadata cluster, go to the L1 table,

17:20

go to the L2 table, the L2 table doesn't have the copy on write data set. So, and in this case, the reference count is yellow, not green. So that's indicating that someone else can access this data, so we need to make a copy. So we make a copy.

17:43

In this case, I'm also assuming that the L2 table would need to be copied. So we end up needing to go L1, L2, L3, copy, and then find the data and copy it and update all of the blocks going up the tree

18:00

to the root to make sure that we have the new data visible. We also need to copy the, to update the reference count of the new block. And in this case, we set it to 1 for agreement.

18:20

So if we need to write new data, we'll need to create a new block, and we may need to create a new L2 entry. So, in this case, we walk to, we walk from the metadata to the L1 table. The L1 table should not

18:40

be blue. Essentially, we have no entry there. So, we need to go through and create the L2 entry, and then create the data entry. It looks a lot like copy on write, except instead of copying, you're just creating a new block. F truncate the disk,

19:01

toss the data on the end, update everything going up to the root. Okay, so, what if we need to read from an internal snapshot? Well, for reading from an internal snapshot, everything I've just gone through still applies.

19:22

Where, the only difference is that the only difference between internal snapshots is what you start from. So, where you have this metadata block, you may have another one over here that you're going to have a different L1 table from, and you just walk through.

19:43

For external snapshots, well, the data might not be in the first place we looked, because we've got multiple disks. So, we need to follow the list of base images. For reading, this is pretty simple. You do the walk, you find the L1 table, you go, oh, that data is not there. Let's look at the base image. The data is there. Well, then you return.

20:02

Exactly the same as if you hadn't done any, if you hadn't had the base image. If it's not there, you keep going, and eventually you hit the end of your chain of base images, and you backfill just as though the data wasn't there at all. Now, on the other hand, if you're writing, the data might still not be in the first place you look,

20:22

but we can't just assume that we need to put it into the, we can't just copy on writing the place that we find it, because now you're modifying the wrong disk. You're not going to get it. This was a bug in the first versions. So, what we end up doing is, we walk down to the previous disk,

20:41

copy it over, and then update the tables, and then you just, so the disk keeps on containing the diff of all the writes against the base. All of that seems simple enough. What about reliability? Well, raw disk images

21:01

are still your best bet there, because there's a whole bunch of big pieces of data to update. What happens if, for example, you update, you write the cluster, and then crash? Well, now you've got the, you've leaked the cluster, and you haven't updated all the parents.

21:20

What if you crash while you're writing the parents? Now you've got an out-of-sync table, so there's still some, there are a bunch of windows for corruption. I think right now, several of them are big enough that you could drive a truck through. I'd like to get them a little bit smaller, maybe we can get them down to a point where all you can drive through is a minivan.

21:43

But, QCOW2 is not the most reliable disk image. It's not terrible, but I've definitely seen corruption when crashing VMD. Don't crash VMD.

22:02

So, as far as the next things I'd like to work on, there's known bugs, so the leakage on the in-disk images is something that I'd like to fix. I'd like to take a look at maybe doing some external journaling to fix the

22:21

reliability issues. So, do a write ahead log on an external file, record what we're going to do to the disk, and then actually do it. And that way, if we crash, we have something that we can replay. I don't know if that's a good idea or not. The other thing that we're doing right now that

22:41

really sucks for performance, is every time we access the disk, we're actually doing a system call to read and write. It might be worth keeping the data in memory instead of hitting the disk on every access. We can also do a neat

23:02

thing like, so right now when we do a search for a cluster, whatever on every read, we'll chain through all of the disk images until we find something. We can keep a Bloom filter, for example, to say which disks contain a cluster.

23:21

And that would let us skip going through the start of the chain. There are a bunch of extensions that are useful, that are probably useful in QCOW 2. I haven't taken a look. It might be worth figuring out if we want to pick some of those. And making internal snapshots an officially supported part of the QCOW 2 toolchain would probably be worth doing.

23:44

Right now, you know, right now you need VM, sorry, right now you can't do anything with internal snapshots using VMcuddle. You need the QMU tool to manipulate this. So, that's basically all I have.

24:01

I think we're kind of early. I'm just thinking about the reliability issue. You write the L1, you write the L2, then you write the data. What if you did it in reverse order? You figure out what you're going to do, then do it in reverse order. You may be losing it if it crashes,

24:22

but it's crash. I mean, all kinds of things can happen. But when it comes back, you won't have a corrupt disk. Yeah, I think we can do a better job about that. I have to think if there's any other issue with it, but yeah. The other, there are other pieces of data that we also need to update

24:41

like the reference counts. I think we can update those first though. So if you do the reference count first, the worst case is you leak a cluster. Then you do the then you do the write to the cluster. Maybe the copy.

25:00

Then you do the actually, if you update the reference count first, you might grow the disk and get an out of bounds cluster. So we'd have to deal with that. But yeah, we can definitely do a much better job about consistency. I assume you have some sort of process that you go through the disk and find and fix corruption. Currently, no. Currently, we need to

25:20

There is a QM tool that will report corruption. I don't know if there's anything that will fix it. It wouldn't be too crazy to write. Yeah, the format's actually Yeah, the format's not too hard to manipulate, so we can probably do that. Any other questions?

25:43

There are some of the extensions that are possible. I would have to look at the spec and I don't remember. There's definitely a bitmap for the copying, so you can actually, ideally, store it all in memory and not have to hit disk on every

26:02

reference count operation. Yeah, I'm trying to remember what others. They were mostly Compression? Oh yeah, encryption is not an extension. That's in the base name. Compression, I think, might be an extension. I definitely

26:21

didn't see a feel for it in the header. Yeah. Are there any extensions that anyone here is interested in? We can also do our own.

26:44

Specifically, QCOW. Why are the storage formats in VMD? In VMD, we just have two. We've got QCOW2 and we've got raw disks. Is there any thought about having one for VMD?

27:01

I haven't thought about it. Anyone else has? But having a VMD specific format? Not necessarily. It's specific to VMD or Azure. QCOW2 seems to be the Latin of VMD

27:20

that gets back in storage or, I think, trades into it. Yep. Oh yeah, I forgot to mention. You can translate between raw and QCOW2 disk images. And, as a hack, that will also clean up the clusters. So, convert to QCOW2, convert back, and you'll get rid of those. Yes. Any thoughts on raw VMDK?

27:42

Which is nothing more than a file that describes the raw image. VMDK, you don't have to implement all of the ESX files. You can actually use a raw file with a little text file. And it's very easy to convert a normal VMDK into one of those

28:02

before exporting. This is for VM transportability. It's how I actually migrate VMs off of VFX onto other VM platforms. What does a text file write? Key equals value. Very simple. Well, you need lots of templates. It describes the image.

28:22

It describes the image. It tells how many blocks the other 10 sectors are. I think it gives you the UUID of some size. Okay. I'm not sure we actually... You literally would be supporting it as a raw type other than the fact that you read

28:40

VMDK text file to get parameters of the raw file. But that's not cute. It's not cute. It's just another extension to support another type in VMD. And it sounds like it wouldn't be hard. I think the hard part of that would be I don't know if we actually support

29:01

setting the UUID and so on in VMD. Yeah. So we have to actually add support to that in VMD. In the emulator hardware.

29:23

Okay. So I can do a demo. If anyone's interested. Cool.

29:40

So I've got a bunch of VMs here. Some of them are QCOW2, some of them are not. Someone asked if you could make it bigger. Ah. Yes. Yes, I can. Some of us have old eyes. Let me try to remember how.

30:04

Nope. So this is ST.

30:23

There's definitely key binding for it. QBammer. If you can't find out in 30 seconds

30:43

the old guys just lose. Including me out here. Ah. Let me think. Nope. That's not it. I think I just closed the window. Everybody who likes hipster pastel colors

31:01

on pastel colors has never given presentations about it. Don't use reverse video. Don't use pastel on pastel. Use dark on light. That's actually what I was thinking. You can use this. Fine. Better? Actually, yes.

31:23

That's just one size. Oh. Now we're editing that directory. Let's see how well this works. This is my turn. It is probably not very smart. Um. Okay. So I do have X turn.

31:40

Does anyone know the key code for making it bigger? You can. Control right click. That's easy. X turn. Control right click. large or huge? Huge.

32:04

Oh. Oh yeah. Much better. Um. Actually, try large. There's no guarantee. No. Okay. Well. That's as good as you get. The SD image.

32:40

Which boot to format. Currently it is .8 gigabytes in size.

33:03

1.7. I can boot it. It does not do them very well.

33:23

X turn does not deal with this. That's okay. You can still run it on the end of the palette. You have an X turn. Your turn cap drops. Yeah. S T T Y Sane. S T T Y Sane.

33:55

It will shut down now. And then I will S T T Y Sane.

34:05

If I could. The lesson you've learned is when the hecklers make you switch your tools there's a two word answer to the hecklers. F U?

34:20

That's too letter. It's also a two word response. S T T Y Sane. T set. Not a T set. I can create

34:48

a disk image off of this.

35:04

Create derived image qcow2 dash V open B S D dot qcow2 dash S Actually I don't need the dash S because it's a base image.

35:20

It knows about the base image's size. It actually needs to match. Otherwise things will kind of go sideways when you try to read outside the base image. So I've got the

35:42

derived image. Now I can boot the derived image. Great. It did something.

36:04

I'm going to switch to the other terminal where it kind of works. I've seen it be cranky on X terms before. I just wrote it off to write.

36:23

I think it's the full screen Yeah, it confuses it. Yeah, the full screen might go back to it. OK, so I've got my Let's write, let's scribble on the disk. Let me show you the

36:41

Let me prove that the disk is small. Control right click. Huge. Extremely not huge.

37:01

And you can see it's 124 megabytes. What is it written? Oh, it relinked the kernel. There we go. And shuffled all the shared libraries around. Yeah, and that'll do stuff. OK, writing to the disk.

37:34

And it's growing. And it filled the disk. So, yeah.

37:43

As you can see, Qt how-to is working. I guess that's about it. And now I'm going to leave the other Oh, yeah, sure.

38:06

It did not change. In fact, it should It will actually complain if it's not a read-only file, I believe. So, we don't even open it with write emotions.

38:24

The other interesting thing that I probably should have put in the slides is there's a whole dance around getting the file descriptors into the Qt how-to disk or into VMD because of the

38:40

prep set. VMD does not have permissions to open anything. For a long time, if you used a derived image, things would go kaboom until you well, things would go kaboom if you didn't comment out all of the pledge and zhroot and so on.

39:01

So, that corrupted a few disks. The reason for this, so what happens is you've got a control process which you need to signal from VMcuddle to tell it to open the disks, pass in an array of file descriptors for all of the derived images, and send them all to the VMD process.

39:23

And until you do that, you can't open the disks. So, the VM doesn't actually have the ability to open any files on your disk.

39:48

Encryption features does kubectl offer filter encryption? Is it predominantly what we have in software, or could you do something different or complement each other?

40:02

I think it's if you care about having it well, if you want to run an operating system that doesn't have full disk encryption, then encrypting your kubectl disk is useful. Otherwise you wouldn't use it. I probably use the internal for a VM.

40:20

If you it's useful if you don't want to use it for whatever reason, but I don't think there's any particular advantage other than it's guaranteed that regardless of what you run, it could be encrypted.

40:42

If I understood correctly, the base image must remain unchanged as long as you want to use the derived image. So, as soon as you have it, as soon as you drive an image from the base image, things will get very, very unhappy if the base image changes. How do they keep track of the image? Is it enough?

41:01

No. It keeps track by through the sys admin. If you it's up to you if you decide to break your derived images. We turn off the read permission

41:20

sorry, turn off write permission before you well, turn off write permission on the base image to prevent human error. But you said it will complain if you and it's not different. It should.

41:40

If it doesn't, I should fix that. So if you have a derived image is there a way to mount this without booting the VM and then having the VM boot up? So that, for instance, you would provision some base template for a machine

42:00

and then you go and say, okay, I have to write too much of that and it has this ht and this host name in it and I mount it basically in the host and swap some files around and close it and then boot the machine and it comes up as an installable VST box. Would that be possible somehow? Or is it? It's possible versus evil.

42:20

We have tools at work that do that sort of thing. You're a disk. So what you need to do is probably, you need to know what file system the VM itself wrote on top of that. Yeah. So we assume it's FFS and then you'd have to have a way

42:41

to, in userland, emulate that block device. You'd have to use, you'd have to use and then mount FFS over the top of the image. And then you'd have to actually know about the format. Yeah, you'd have to know about the format. So we're going to do a vscode v. You're going to have to have a 4 or a pw. Pw.

43:01

It's a gmod game. If you wanted to mount it in the host. Are you serious or are you probably It's what vscodes are. It's what vscodes are. It's what vscodes are. It's what vscodes are. For a small game. For a vscode that's vscode. For a vscode that's vscode vmd and just a patch of vscode.

43:22

And the code itself is actually fairly small. I checked before this topic. I'm very kind of blind. And you don't think you could be so far from that? No, if you don't be far less, you don't have You'd use here in the mid player. You'd use here in the mid player. You'd use here in the mid player. You'd use here in the block player. This is the block player.

43:41

You should use the kernel's FFS to deal with the FFS on the disk. On the disk. Yeah, it should be. So, do we like it? The problem is you need to you need to analyze SCSI. So you need to do SCSI.

44:02

You need to do the target. Yeah, yeah, yeah. Sure, it's just SCSI target code. I've at least been involved in two projects that involve basic SCSI target code. I'm not doing it again.

44:21

A simple vscode. A simple vscode. A simple vscode. They're not there. How many derived images can there be? Can there be more than one derived images at the same time, right? So you can have 50 clients

44:40

running off of derived images all sharing the same base. So this is actually why it's kind of interesting. Peter keeps telling me he's got some code that he's working on for a federal snapshot. Where you basically can create a disk image and say, I only wanted

45:00

to run with these disks for this long. Or, I only wanted to run with and throw away the disks when you're done using the VM. So you can do that. You can put as many VMs as you want off of that. And this will be great for, say, testing. If you're doing, say, file system development, then you want to poke at it and if you screw up the file system,

45:22

go back to something that's actually workable, this is kind of a useful feature. Or, if you want to do one of that. Oh look, I'm screwing with libc today. Yeah, there you go. Oh look, I just made my system unusable. Oh, I'm going to go back to the system.

45:41

Or if you're doing anything else that will screw up your system and you're doing development on it. Very useful. That's what we do. I do all my installs on VMs with displayable file systems. QEMU has a Snapchat option which does exactly that and it curates the drive images, unlinks it

46:01

so it's hidden away from you. You can use it, and I would use this for simulating networks. If we're doing like kernel testing on kernel development. So, it'd be fine, when I'm done, delete it, and I don't care. Yeah, regression tests would be a great place for this. As a side note,

46:20

initially I implemented the ephemeral snapshots where we kept all of the diffs in memory instead of syncing them to a disc. This turns out to be a bad idea because we tend to run out of memory fairly quickly on this. Well then you can do this easily when you start your if you want it disposable when you start your derived image you simply have an option to where

46:41

VMM unlinks the file. Cause it's just going to open it pass a descriptor, it'll unlink the file the file is gone. As soon as you close, as soon as VMM stops and the last opener closes that shit's gone in the file system. I have that in the wrong place. I have all that code in the wrong place. Yeah, you just need to put it in.

47:02

The only thing that's a little bit iffy about that is where do you put the image? It doesn't matter. Well, it does if you put it on temp and your temp is kind of small. Yeah, that's true. Don't be stupid. My point is you can do it in your directory right away whereas today, as soon as it starts it unlinks it

47:22

it will continue to use space in the file system as soon as your VMM exits it's pretty gone. The problem with that is if you want to have users write use an ephemeral VMM have root on the space image or something where you don't necessarily have write permission for the VMM

47:40

or the directory containing it. There's a whole bunch of touchy questions around it that are very bike shed worthy which is why I didn't end up doing it yet. I guess I'll probably end up repeating your code for that. We have control over that. We can add more flags. It's not Alice.

48:04

Well, as we can see from Git's UI, it clearly works very well. Oh yeah, absolutely. There's nothing interesting going on there. Any other questions, comments, thoughts, complaints?

Empfehlungen