We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

composefs: An opportunistically sharing verified image filesystem

00:00

Formal Metadata

Title
composefs: An opportunistically sharing verified image filesystem
Title of Series
Number of Parts
542
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Recently we posted patches on the lkml for a new filesystem called composefs. This is an image-based read-only filesystem with opportunistic file sharing and fs-verity based verification. This presentation will give a short demonstration of how to use it and present the usecases we wish to solve. Hoping to get feedback from interested users and linux filesystem developers.
14
15
43
87
Thumbnail
26:29
146
Thumbnail
18:05
199
207
Thumbnail
22:17
264
278
Thumbnail
30:52
293
Thumbnail
15:53
341
Thumbnail
31:01
354
359
410
Computer-generated imageryJames Waddell Alexander IIPhysical systemNetwork topologyRootSoftware repositoryObject (grammar)Inheritance (object-oriented programming)Directory serviceData structureLink (knot theory)Computer fileKernel (computing)Atomic numberRevision controlFile formatData storage deviceFormal verificationRun time (program lifecycle phase)Computer configurationBlock (periodic table)Object (grammar)Keyboard shortcutRootMedical imagingRight angleRepository (publishing)Revision controlMultiplicationDescriptive statisticsBlock (periodic table)Point (geometry)BootingFile formatComputer fileDirectory serviceForm (programming)Disk read-and-write headBranch (computer science)Content (media)Inheritance (object-oriented programming)Data structureArithmetic meanReal numberElectronic mailing listMetadataQuicksortCore dumpMatching (graph theory)CASE <Informatik>Goodness of fitKernel (computing)Hash functionPhysical systemFormal verificationFile systemDirac delta functionExistenceConfiguration spaceElectronic signatureBinary treeDiagramComputer animation
Object (grammar)Computer-generated imageryRootComputer virusCompact spaceFile formatVariable (mathematics)StrutSoftware repositoryMiniDiscCache (computing)Web pagePrice indexMedical imagingSubject indexingMetadataBitMathematicsComputer fileEmailTable (information)Type theoryPoint (geometry)Web pageForm (programming)Cache (computing)Electronic mailing listForcing (mathematics)Content (media)MereologyFile formatFile archiverData structureOrder (biology)Computer-assisted translationPartial derivativePhysical systemMiniDiscOverlay-NetzComplete metric spaceInformation securityCASE <Informatik>Shared memoryObject (grammar)Stack (abstract data type)RootSimilarity (geometry)Kernel (computing)Different (Kate Ryan album)Variable (mathematics)QuicksortRepository (publishing)Mathematical optimizationDirection (geometry)File systemMultiplication signBinary codeRun time (program lifecycle phase)Directory serviceComputer animation
File formatMiniDiscWeb pageCache (computing)Price indexObject (grammar)Directory serviceData structureSparse matrixBenchmarkPhysical systemRange (statistics)Maxima and minimaPairwise comparisonMedical imagingData storage deviceSingle-precision floating-point formatLocal ringContent (media)Formal verificationOverlay-NetzCache (computing)Operator (mathematics)Computer fileExterior algebraData structureHand fanFile formatDifferent (Kate Ryan album)Multiplication signElectronic mailing listSimilarity (geometry)Order (biology)Parameter (computer programming)Hash functionElectronic signatureCentralizer and normalizerStreaming mediaSemiconductor memoryMeasurementMereologyNumberSoftware developerCASE <Informatik>BootingFile systemReading (process)Wrapper (data mining)Row (database)SurfaceCodeCombinational logicKernel (computing)Network topologyLatent heatExtension (kinesiology)MetadataAttribute grammarBitSoftware maintenanceAddress spaceComputer animation
Computer fileMiniDiscSingle-precision floating-point formatCompilerRemote procedure callImplementationObject (grammar)outputCASE <Informatik>RoutingSource codeCommunications protocolData storage deviceForm (programming)Slide ruleEmailMereologyComputer animation
Medical imagingBitCASE <Informatik>1 (number)Multiplication signLecture/Conference
Program flowchart
Transcript: English(auto-generated)
So, we're ready for our next talk. Alex is going to talk about a new file system that they're proposing, ComposeFS, and opportunistically sharing verified image file system. Thank you. All right. Thank you. Can you hear me? Yeah? All right. All right. I'm Alex. You may also know me from hits such as Flatpak, Flathob, GNOME, GTK, all sorts
of stuff. But this here is my first kernel file system, which I proposed on the list a couple of months ago. So, it's not actually like a file system, a real file system. It's more targeted for read-only images, such that you would typically have many of them
running on the system, maybe in a container host, or in my case, my primary concern is the OSTree verified boot use case. So, rather than talking about ComposeFS first, I'm going to talk about OSTree because it kind of explains where this comes from.
So, in OSTree, we have images. Normally, the images are not simple like this, but actually the full root file system for your system that you want to boot. But they're just a bunch of files, and they have metadata and permissions and names and whatnot. So, they're basically images. And we have this repository format, which is the core format of OSTree.
And what we do is we take all the files, like the regular files in the image, and we hash them. And we store them by the hash name in this repository format. So, if you look at any of those files, they're just the same file with the name
of their own content. And then we take all the directories we have, such as the subject thing, and we make a small descriptor of them, the names of the file in them, their permissions and whatnot, and a reference to the content of the file by the checksum.
And we do the same for the root directory. And this time, we refer to the subdirectory by the checksum of the descriptor. And then we add a final commit description, which describes, well, it has a pointer, meaning the checksum of the root directory. And a parent doesn't have a parent, because this is the first commit, some metadata.
And then we add the refs file, which is basically just a file that says we have a branch called image, and this is the current head. Right? So, if anyone thinks this looks like the .git directory in any of your checkouts, that's
true. It's basically Git for operating systems, right? There are some details in the, exactly how the object files are stored, but basically the entire structure is Git, right, it's just a copy of Git. And you can see even more clearly, if you create a new commit, the new version, we added the README, so all we have to do is add the file, the new root directory,
and the new commit that points to the previous one, and then we update the ref to the latest head. So, basically we implement Git for large binary trees. But you can't use this directly, like you can't boot from a repository like this.
So what you do, deploy, we call it deploy, when we run a new version of something, typically you have a pre-existing version, so you download the new version of the thing you want to run, which is very simple because you can just iterate over these recursive description of the image, and whenever you have a reference to an object you already
have downloaded, you can just stop, because recursively you know you have all the previous things, so it's very efficient to get the new version. And then we create a deploy directory, which is basically a hard-linked form that points
back into the objects, like the regular file objects. So, we create the directories with the right permissions and whatnot, and whenever there's a regular file, we just point it at the same file, by using a hard-link to the one in the repository. And then we somehow set some boot configuration that points to this particular commit, which
names this directory, and somewhere in the initrd we find this directory, bind mount it, read only on the root, on top of the root, and then we boot into that. And there are some clear advantages over this, over what would be the alternative,
which is the traditional AB block device, you have two block devices, and you flash new image from B, and then you boot into B. First of all, it's very easy to do efficient downloads, and deltas are very small. You can easily store however many versions of things you want, whether they're related
or not, like if it's multiple versions of the same branch, you can keep the last ten or whatever, plus you can also have completely unrelated, you can have boat and fedora and a rel or devian or whatever, and you can easily switch between them atomically.
And all the updates are atomical, we never modify an existing thing that you're running, we always create a new deploy directory and we boot into that. And also the format is very verifiable, it's recursively describing itself, and all you
need is the signature, and there's a GPG signature on the commit object. So if you trust the commit object, you trust the root hash, you trust the hash of the subdirectories and the files and whatnot. The problem that I want to address is that this doesn't do runtime verification.
We verify when we download things, we can verify when we do the deploy, or rather the fact that we're deploying is going to cause us to verify things. But if at some point after that, something modifies, say we have a random bit flip on
the disk, or we have a malicious, like, evil maid style attack, someone could easily just remove or modify a file in the deploy directory. And to protect against this, the kernel has two features, deinverity and fsverity. Deinverity is what you use in the typical AV image system, because it's block-based,
but it's completely a read-only block device, there's no way we can do OSTree-like updates to its file system, you just cannot write to it. So the other thing is fsverity, and fsverity sort of matches very well with the OSTree
repository format, because if you enable fsverity on a particular file, it essentially makes it immutable. And immutable is exactly what these content arrest files are, so it's good. But the problem is that fsverity doesn't go all the way, it only protects the content
of the file, like you can easily make it set UID or replace it with a different one that has a different fsverity, or just add a new file or whatever. So it doesn't protect structure. So that's why ComposeFS was created, to have another layer that does the structure.
And now I'm sort of going away from the OSTree use case, and this is the native way to use ComposeFS, where you just have a directory with some data, this is the same kind of example that I had in the repository format, and you just run mk-composefs on that directory,
and it creates this image file that contains all the metadata for the structure of the thing, and this object's directory, which is just copies of these files, stored by their fsverity commit, or fsverity digest.
And they obviously use pure files, you can cat them, and they're just regular files with the same contents. They're actually pure data files, you can see they don't have like the executable rights or if you have some complex metadata, extended attributes or whatever, these are just regular files with content. And then you mount the thing using ComposeFS, pointing it at this object directory, and
then you get a reproduction of the original image that you can look at. Whenever you cat this, it will just do overlay fs-style stacking and read the backing file. So everything is always from the page cache, and also the actual mount is not a loopback
mount, we just do stacking-style direct access of the file from the page cache. So that gives you the general ability to reproduce this image, but to get the fsverity, or complete structural verification, you actually use fsverity on the descriptor itself.
So if you enable CFS on that, that makes it immutable, so the file can't change on the file system, at least the kernel API doesn't allow you, and if it's somehow otherwise
modified on disk or whatnot, it will detect it. And you can see, I actually passed the digest, the expected digest, and whenever it mounts it, it starts by verifying, before it does any IO, does it actually have the expected fsverity digest? And if so, we can rely on the kernel to basically do all our verification from us,
and if you replace something, we have in the metadata for all these backing files the expected verity digest. So if you replace something, or if there's a random bit flip, it will detect it. And actually, the descriptor itself is very simple.
This is not a traditional file system where we have to update things at runtime. We can just compute a very simple descriptor of this, and it's basically a fixed-size header followed by a table of fixed-size inode data, but if the file system has n inodes,
then there are n copies of that structure, and some of them point into the variable-size data at the end, which we found with a vdata offset in the header. And that's basically all there is to it, right?
We can, inode zero is the root file system, or it's the root inode. You can look at that, and if it's a type directory, then the variable data points to a table of dirants, which is basically a pre-sorted table of dirants plus names that you use binary search, you get a new inode, then you just look at that offset,
and all this is just done by mapping the page cache directly. So it's very simple in terms of structure. If you want to use this actually with OSTree, it's slightly different. Like we can't just, we don't want to take the OSTree repository, create this directory,
and then run mk-composefs on it. Instead we ship a library, libcomposefs, where you basically link OSTree with this library, and it can generate these images directly from the metadata that exists in the repository. So we don't have to do any kind of expensive IO to create these images, because it's
just the metadata, all right? It's not very large. You can put it into memory, generate these, optimize them, and just write a single file. And the way we can do it, it's very flexible, so we can ensure that we can use the existing repository for the backing files.
And it's also designed so that there's a standardized way, we put everything, so every time you create a new image, based on the same OSTree commit, we will be creating the exact same binary file, bit by bit. So what you do is that when you create a commit on the server, you basically generate one
of these, take the digest of it, put it in the same commit, and then whenever you recreate, there's no need to extend the OSTree format on the network or anything, what you do is when you deploy a commit, instead of making this hardlink farm, you recreate one of these,
and then you use the supply digest as the expected digest when you mount it, so if anything anywhere went wrong or was attacked or whatever, it will refuse to mount it. So obviously you have to put that trusted digest somewhere in your secure book stack
or whatever, something would have to, it has to be trusted, but that's outside exactly of the scope of ComposeFS, and it's very similar to what you would do with dmVarity in a pure image-based system. But another interesting use case is the container use case, and Giuseppe, who is not here
actually, but he is one of the other people behind the ComposeFS developers, he is one of the podman developers, so his use case is to use this for containers, because containers are also used in images, right? And it would be nice if we can get this very, what I call opportunistic sharing of files,
like if you use layers and stuff, you can sort of share stuff between different containers, but you have to manually ensure you do the right thing, whereas with this opportunistic style of sharing, whenever you happen to have a file that is identical, it just automatically
gets shared, both on disk and in page cache, because of the way these things work. So we also don't want to change the container format, there was a talk yesterday about using dmVarity and SquashFS for, it's not sharing, but the similar kind of way
you can mount an image, but we don't want to, that forces all the users to create a new form of container, but we want to use, allow this for all existing tarball-based layered
OCI images. So an image, we always say, well, this is a list of tarballs that you extract in order and then you're mounting using over the FS, there is an extension of this called ee-tar-gz, which is some weird-ass hack where you put an index at the end of the gzip, and
then you can use partial HTTP downloads to just get the index, and you can see which part of the layer you already have, and you can just range HTTP gets to only download those parts. So if you happen to have one of those archives in your layers, we can, in combination with
the locally stored content of the storage, avoid downloading the parts that we don't need. If you don't have them, we have to download everything, which is what we do now, but we can do better. But even then, you can still hash them locally and get all the sharing, and then
you combine this with creating an overlay FS, or a composed FS for the entire image, right? So you mount, this is for the local storage of images. You can use, instead of storing these layers, you store the repository, content store repository,
plus these generated composed FS images, and then whenever you run this, you just mount it and it goes. It's also nice, you can easily combine all the layers. So if you have a 10 layer container, and you want to resolve libc, which is in the
base layer, you have to do a negative look up in every layer before you reach the bottom most. But since the image is just metadata, it's very cheap to create a completely squashed
composed FS image for single layer lookups. And, I don't know if anyone is following the list, but there are some discussions about this. I'm trying to get it merged upstream. And one alternative has appeared, that there's ways that you can actually use some of overlay
FS features to sort of get these features. If you use the not super well known or documented features called overlay redirect and overlay metacopy, you can create an overlay FS layer that does a similar style of, here
are the metadata for this attribute redirected to a different path, which would be the content address name in the lower layer, and then you can use some kind of read only file system for the upper layer, where you store all these extended attribute files that just contain
this structure. So this combination of overlay FS plus, right now E row FS is probably the best approach for those, for the upper layer. You can sort of create this. Unfortunately, that doesn't do the verification.
You can use the verity on the overlay, or I mean on the read only file system itself, but you need some kind of extension to overlay itself to allow this recording of the expected FS verity of the backing file. But that does seem like a trivial thing.
The less trivial part, and this is where opinions on the list vary, is I think that this kind of combination of things is way more complicated than the simple, like Compose surface is, I think, 1,800 lines of code. It's very direct.
It doesn't use loopback devices, device wrapper devices. When you do a lookup in this combined thing of a particular file, you would do a lookup in the overlay layer, in the read only system, and in all the backing layers.
So there's like four times more inodes around, there's four times more decache lookups, and it just uses more memory and performs worse. So I ran some simple lookups, but these are just, some people complain about the measurements
here. I'm just comparing like a recursive find or LSR, which is basically just measuring the performance of lookups and readers. But on the other hand, that's all that Compose FS does. I mean, all the actual IO performance is left to the backing file system.
So wherever you store your real files, that's where like streaming performance and things like that would appear. So I'm personally in the automotive use case right now, so we have very harsh requirements on cool boot performance. So the cold cache numbers are very important to me.
I mean, you might not, this is like listing recursively a large developer snapshot, like a three gigabyte central stream nine image. So it's not an operation you might do, but just looking at the numbers, the recursive listing is more than like three times slower for the cold cache situation because it has
to do multiple lookups. And even for the cached case, where most things should be from the d-cache anyway, I think I've seen better numbers than this, but there's at least 10% difference
in the warm cache situation. I hope that was useful to someone. Yeah, we have some time for questions. So you said about halfway through that one of the goals was to actually keep reading
the OCI image format, but I think everybody pretty much agrees the OCI image format is crap for lazy pulling of container images, basically because it has an end-to-end hash, so you can't do the hash until you pull a whole image, and that means signatures are completely rubbish anyway.
In order to fix this, we have to do a Merkle tree or something else anyway, so the image format is going to have to change radically to something that will be much more suitable for your image. So I think trying to keep the image compatibility, which is partly what the argument over this versus the in kernel alternatives, is not going to be a good argument for that.
And I think you should consider that. I agree and I don't agree. I mean, I think I'm not a fan of OCI. I've been part of the OCI specification team for a bit. I used to be one of the Docker maintainers a long time ago. It is not nice, but it is what we have, and it's everywhere.
It's so easy as like a developer sitting around thinking this is bullshit, we should just fix it. But there are like trillions of dollars invested in the existing containers, and it's going to be a long time. Even when we replace it, this will still do the right thing.
But there are discussions of OCI v2. I don't follow them because the whole thing is bullshit. But even then, if we just had a better way to get partial updates for an image file, you could still use this to use it.
Before taking the next question, I'm obliged to point out from the chat that these performance numbers are before optimising overlay FS and E-Roughs. Yeah, there's been some work in that and optimising, there are ideas to make the overlays
stuff work better. Would that be possible, maybe? I actually still had a question. So here in the back.
Well, it's not really a question, more a remark. I think there's sort of like one missing slide in your deck, namely one use case that you haven't considered at all but still really worth calling out. Many remote build systems, such as like Goma, Bazel, et cetera, are all nowadays converging
on the single remote execution protocol called REV2, and that one is actually also using a CAS as a data store for storing both input files for compiler binaries, source files, header files, but also output files, object files that are generated. I actually maintain one of the implementations, and one of the hard parts of implementing
such a build cluster is instantiating the data that's stored in the CAS in the form of an input route on disk where you can just run a compiler against certain source files. And a tool like ComposeFS would also really help in such an implementation. That's just something I wanted to call out, and you should really also market it towards those kinds of use cases.
It makes a lot of sense. Yeah, I'm sure images are used for all sorts of stuff. I'm sure there are many use cases other than the ones I've mainly focused on. Okay, then since you already ended the Q&A a bit early, then the next talk is going to be recorded.
It gives us a bit more time to prepare. Thank you very much for the talk. Thank you for all the questions and being here.