We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Boot2container: An initramfs for reproducible infrastructures

00:00

Formale Metadaten

Titel
Boot2container: An initramfs for reproducible infrastructures
Untertitel
Who needs host OSes for containers anyway?
Serientitel
Anzahl der Teile
287
Autor
Mitwirkende
Lizenz
CC-Namensnennung 2.0 Belgien:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
Fed up with managing your host OS for your docker environment? Try booting your containers directly from a light-weight initramfs! Flash a USB pendrive with the kernel and initramfs, or netboot it locally or from the internet, configure it from the kernel command line. Bonus: It also supports syncing volumes with S3-compatible cloud storages, making provisioning and back-ups a breeze! Containers have been an effective way to share reproducible environments for services, CI pipelines, or even user applications. In the high availability world, orchestration can then be used to run multiple instances of the same service. However, if your goal is to run these containers on your local machines, you would first need to provision them with an operating system capable of connecting to the internet, and then downloading, extracting, and running the containers. This operating system would then need to be kept up to date across all your machines which is error-prone and can lead to subtle differences in the run environment which may impact your services. In order to lower this maintenance cost and improve the reproducibility of the run environment, it would be best if we could drop this Operating System and directly boot the containers you want to run. With newer versions of podman, it is even painless to run systemd as the entrypoint, so why not create an initramfs that would perform the simple duty of connecting to the internet, and download a "root" container which can be shared between all the machines? If the size could be kept reasonable, both the kernel and initramfs could then be downloaded at boot time via iPXE either locally via PXE or from the internet. This is with this line of reasoning that we started working on a new project called boot2container which would receive its configuration via the kernel command line and construct a pipeline of containers. Additionally, we added support for volumes, optionally synced with any S3-compatible cloud storages. This project was then used in a bare-metal CI, both for the test machines and the gateways connecting them to the outside world. There, boot2container helps to provide the much-needed reproducibility of the test environment while also making it extremely easy to replicate this infrastructure in multiple locations to maximize availability.
Treiber <Programm>SpieltheorieEgo-ShooterTLSVererbungshierarchieSoftwaretestThermodynamisches SystemSuite <Programmpaket>SoftwareentwicklerInhalt <Mathematik>RückkopplungKernel <Informatik>Interface <Schaltung>p-BlockSpezialrechnerMini-DiscModul <Datentyp>MultiplikationsoperatorPunktSoftwaretestCASE <Informatik>Thermodynamisches SystemEindeutigkeitSoftwareentwicklerBitSuite <Programmpaket>BootenSpieltheorieProgrammierumgebungWurzel <Mathematik>Treiber <Programm>Zusammenhängender GraphGraphikprozessorSichtenkonzeptZehnLaufzeitfehlerInterface <Schaltung>Produkt <Mathematik>Komplex <Algebra>Mini-DiscResultanteTranslation <Mathematik>p-BlockMereologieFokalpunktDatensichtgerätVirtuelle MaschineKernel <Informatik>SoftwareSystemplattformPartitionsfunktionLineare RegressionSynchronisierungRückkopplungKomponententestMathematikBildgebendes VerfahrenBasis <Mathematik>InstantiierungWeg <Topologie>GraphfärbungDesign by ContractNormalvektorNatürliche ZahlCoxeter-GruppeGerichteter GraphFlächeninhaltSoundverarbeitungVerkehrsinformationBeobachtungsstudieRouterRoutingBenutzerbeteiligungAggregatzustandArithmetisches MittelComputerspielInhalt <Mathematik>DiagrammTechnische ZeichnungComputeranimation
SoftwaretestSpezialrechnerMini-DiscThermodynamisches SystemModul <Datentyp>Interface <Schaltung>Sampler <Musikinstrument>Suite <Programmpaket>Kernel <Informatik>RechnernetzDynamic Host Configuration ProtocolCachingInformationsspeicherungChiffrierungSystemstartDemo <Programm>ProgrammbibliothekBootenSoftwareSpezifisches VolumenMultiplikationsoperatorSkriptspracheKernel <Informatik>Nabel <Mathematik>EinsVirtuelle MaschineInternetworkingSystemplattformDienst <Informatik>SpielkonsoleProjektive EbeneMathematikZusammenhängender GraphPunktwolkeProgrammierungKonfigurationsraumInteraktives FernsehenSoftwaretestSuite <Programmpaket>ProgrammierumgebungWurzel <Mathematik>SystemstartGemeinsamer SpeicherBildschirmfensterMobiles EndgerätStellenringMereologieSchlüsselverwaltungCachingVererbungshierarchieElektronische PublikationCodeDemo <Programm>Prozess <Informatik>ProgrammfehlerComputerarchitekturSynchronisierungMetropolitan area networkLokales NetzBimodulRoutingInformationsspeicherungAutomatische HandlungsplanungCASE <Informatik>Mini-DiscInstantiierungLesezeichen <Internet>HardwareGrenzschichtablösungAuswahlaxiomInterface <Schaltung>Computeranimation
ProgrammbibliothekMini-DiscKernel <Informatik>Demo <Programm>GenerizitätSimulationSchaltwerkEreignishorizontTreiber <Programm>ZeitzoneSchnelltasteProtokoll <Datenverarbeitungssystem>SocketW3C-StandardROM <Informatik>NormalvektorTermSCI <Informatik>CoprozessorVersionsverwaltungVerband <Mathematik>StandardabweichungBenutzerfreundlichkeitElementargeometrieFlächeninhaltBlockplanSpezialrechnerInformationSicherungskopieDateiverwaltungThermodynamisches SystemVolumenCachingSpielkonsoleWurzel <Mathematik>CachingSystemstartInformationsspeicherungDateiformatInformationSoftwareVirtualisierungZweiPartitionsfunktionLaufzeitfehlerBootenBefehlsprozessorRoutingMehrrechnersystemVirtuelle MaschineAuswahlaxiomInstantiierungMini-DiscMultiplikationsoperatorComputeranimationProgramm/Quellcode
E-FunktionGEDCOMCodeSuite <Programmpaket>SystemprogrammierungServerPunktwolkeFeasibility-StudieSystemstartHardwareKette <Mathematik>Thermodynamisches SystemVersionsverwaltungSoftwarewartungOrdnungsreduktionInterface <Schaltung>Architektur <Informatik>Modul <Datentyp>Kernel <Informatik>BinärdatenSoftwaretestMereologieWeb logZweiDemo <Programm>MultiplikationsoperatorCASE <Informatik>SpielkonsoleDickeVirtuelle MaschinePunktwolkeEchtzeitsystemBootenMetropolitan area networkDifferenteWellenpaketResultanteElektronisches ForumARM <Computerarchitektur>DatenbankBitSkriptspracheFunktion <Mathematik>Nabel <Mathematik>Prozess <Informatik>RichtungHalbleiterspeicherRoutingComputerarchitekturQuellcodeCodeKette <Mathematik>Thermodynamisches SystemBus <Informatik>AggregatzustandDatenstrukturMathematikAutomatische HandlungsplanungInternetworkingTouchscreenSpezifisches VolumenStabilitätstheorie <Logik>InstantiierungVariableSoftwarewartungLesen <Datenverarbeitung>SoftwaretestProgrammierumgebungSuite <Programmpaket>HardwareAusnahmebehandlungKernel <Informatik>Gewicht <Ausgleichsrechnung>VererbungshierarchieWurzel <Mathematik>SystemstartSystemaufrufProgramm/QuellcodeJSONComputeranimation
ZweiEinsComputeranimationBesprechung/Interview
BitClientArithmetisches MittelBesprechung/Interview
Metropolitan area networkSoftwareRouting
SoftwareKeller <Informatik>MultiplikationsoperatorBootenGatewayRechter WinkelKonditionszahlSpezifisches VolumenEchtzeitsystemVorzeichen <Mathematik>Reelle ZahlDatensatzPunktBesprechung/Interview
MAPSynchronisierungKonditionszahlCodeMathematikCASE <Informatik>Demoszene <Programmierung>Web SiteTouchscreen
VersionsverwaltungARM <Computerarchitektur>Virtuelle MaschineKernel <Informatik>BootenSoftwaretestObjekt <Kategorie>TouchscreenMathematikExistenzsatzBesprechung/Interview
BitElektronische PublikationBootenZweiKernel <Informatik>Besprechung/Interview
Analytische FortsetzungBesprechung/InterviewComputeranimation
Transkript: Englisch(automatisch erzeugt)
Hi everyone, thanks for tuning in for my boot to container presentation, which is an init from Fs that does exactly what it says on the tin. So since I'm not known in this community.
My name is Martin Rokala, I am mostly active in the graphics subsystem, where I am mostly known under the nickname Mupuf, or by my primarital name Martin Paris. So I am now a freelancer working at Mupuf TMI and a Valve contractor.
So my mission is to create a production ready upstream Linux graphics driver. And what does it mean? It's a lot to unpack. So let's let's focus on parts of it. So first graphics drivers. Well, the point is, of course, to have nice looking games, high FPS, low latency.
And I don't know if you looked at the size of the Linux kernel driver, but that GPU kernel drivers, they are enormous. And the complexity of it is, is insane. So okay, on this, and we've got this, then we've got production ready, which is more like the user point of view.
So from the user point of view, it has to be usable. So basically fit, fit the needs of the user, then it has to be reliable. So every time they try to use it, it works. And same with available. Now upstream Linux. Well, upstream is is where development is happening.
So from this point of view, you're going to have the best compatibility for games and GPUs, and also the best performance. But you get the worst reliability, because of course, some changes to create regressions that are not caught by users yet. So on the bleeding edge, you might be bleeding a bit.
So how do we actually make upstream working, because there are some contradiction, upstream Linux being worst, worst reliable. And then we want something reliable. And GPUs also complex beast, so it's impossible to test everything.
So no, it's, I don't think we, I don't think it has to be a contradiction. Because we could use automated testing to help this. And so people might be wondering now why I'm talking about this, the topic is boot to container and it's coming. Now I'm explaining the reason why I need it so bad.
So automated testing for the graphics subsystem is very, very tricky. So every graphics component needs its own test environment. So there's the kernel, there is the 3d driver, there is the display driver, there is by displaying the windowing driver.
There's so many components. So all of them need to be tested, or even the translation layer between DirectX and Vulkan. Something that is really important for just running games. So on top of this, the test suites that we have are enormous. So for instance, the one for Vulkan is getting closer to 1 million unit tests.
Then the games are even harder to test because they are designed for users, not for automated testing. And test results need to be stable, reproducible by developers when there is a problem or not.
But I guess what matters is mostly when there is a problem so that they can just debug it. And developers also need feedback as soon as possible. So when they make a patch, they want to test it and they don't want to wait a month to get the results. So they need results in a matter of hours.
And the problem is that the test content, if we were to use only a single machine, we would get six hours of runtime. So we need tens of machines. And since they're going to be running unreliable kernels and GPUs are notoriously very happy to crash your system,
if you look at them the wrong way, then you get some very interesting problems for automated testing. So how do we make such a CI system that would be able to deal with all of this? Well, I mean, of course, there's a lot of issues.
But then what matters really in the end is creating blocks that have a very, very good interface. So the way I would say that a component is good is when you can take it out from the CI system and use it in many other places.
Basically, if I had another use case that would need something similar, I would want to use this component rather than having to reinvent my own or a second one. So the point really is that the interfaces need to be so versatile that they just solve the problem nicely.
So since it is a bit difficult to explain, let's take an example that is actually closer to the container dev room. So case study, creation and deployment of the test environment.
So how do we generate the test environment? Well, there's two ways. There is the traditional way in the embedded world where you, well, that is called the root FS or generating a root FS.
Or then you've got the OCI containers way, which is mostly found in the web world. At least that's my understanding. And for unique testing, it's very, very good. So a root FS can be created using Yocto, Buildroot, DevOS or any other system like this.
Whereas containers are usually created using Docker, Podman, Builder or something else. So a root FS is a full disk image. So from this point of view, it is self contained. But then that also means that if you want to update it,
then it is much slower because you need to send a full image unless you're using CA sync. But let's not get there. And also it is not as portable. That means that if you have a root FS working for a particular machine, moving it to another one is not going to work nicely.
It's just like on your desktop machine. If you change your system completely and try to reuse the init from FS that was created for your machine, it's likely not going to find your root partition. So on the contrary, on the container side, the problem with it is that it requires platform setup.
That means that you cannot just boot the container. You need to take a machine, boot it and boot it directly to a container. You need to have platform setup, like for instance, network or the disks.
But the benefits is that it is faster to deploy because the base OS is already cached using the layers. So only the layers that changed are going to be needing to be downloaded. And hopefully that's a small amount. And then you have a high portability because containers have been designed for this from the get-go.
It means the same container can be run everywhere. On the same architecture, of course. Now, if we go back to the concept of interface, then the root FS does two things. It is platform setup and it is a shared test environment for all the other test suites.
That means that if I had to have another project or another component that I need to retest, I would probably want to just duplicate the code that was there for one component and copy-paste it for another one and just make the changes there.
It's not wonderful. Now, on the container part side, the containers is providing an isolated test environment for every test suite. And that means that they are composable. We can run one and then the other. And then it's just as if they ran for the first time on booting.
Of course, unless you crashed your kernel or your hardware, but that's a separate thing. Now, the question is, as I was saying before, how do we start a container then? Because the container requires platform initialization.
So, do we need to make a new root FS for this? Well, as I alluded at the beginning, no. I've been working on a project called boot to container, which is a small initramfs that you configure using the kernel command line.
And it has some nice features. So, first, it has some network services. So, it's going to get the IP from a DHCP server, so you get access to the internet. And then it also will synchronize the time, so as you're not out of sync with the rest of the world.
Then it also allows you to have a cache drive, so you don't have to re-download the same layers all the time. This cache drive can be auto-selected, auto-formatted, and you can have a swap file.
So, if you run out of RAM, then you can say, well, now create a 4 gig swap file, and it's going to be using that. Very simple. And finally, there's support for volumes. So, volumes are just like docker volumes or podman volumes.
They're used to share data between containers. But in our case, we can also provision the volume on startup, or whenever you want. And it is provisioned using an S3 compatible storage, so, like, so-called cloud storage,
or S3 bug plays, anything like this. So, this is pretty nice. Then we can have the volumes encrypted using fs-script, which is nice if you have some jobs that need to run and store some very big files, but then they have to be private,
so as some other jobs that would run on the same machine later on would not have access to them. But then if you have the key to decrypt the folder, then you get access to the files without needing to redownload potentially a terabyte of data.
And finally, you can specify an expiration time. So far, the only thing that we have is when, like, either you keep the volume after the machine stops, or you destroy everything at the end. So, if it's a temporary job, then that specifies a volume,
then you can say just delete it at the end. Okay, and finally, the booter container is ready for multiple architecture. So, it is based on Uroot, which is written in Go. It's also based on Podman, or again written in Go. There are some C programs in there,
but they're very tiny and have no dependencies. And the ones that have, then, I take from Alpine, which has support for a lot of architectures. And then most of it is written in shell script. So, how do we use boot to container?
Well, you can use it directly, or netbooted, sorry. So, if you want to use it directly, here is an example using qlnew. You just specify a kernel, the init from fs, so boot to container. Then we say use the kernel command line console equal ttys0,
which means just draw on my console. And then I put no graphics, so I don't have a separate window starting. Then I say boot to container dot container, which means start the container. That is going to be interactive, and it's going to be Alpine.
That's it. So, if you don't want to run it like this, that have already have a host OS, but you actually want to boot really like bare metal, then you can also use your favorite bootloader, like grab your boot or anything else. But you can also netboot using pixi and HTTP
for machines inside a trusted local network, because pixi is not exactly secure. But if you want something that is secure against the man in the middle, so, for instance, if you want to boot over the internet, or get your configuration and init from fs and all this through HTTPS, then you can use IPXE.
And this is great for standalone machines on the other side of the planet. Yeah, that's it. So, I'm going to make a quick demo to show how it works. So, the demo has been set up like this. So, first I downloaded boot to container. Then I downloaded kernel associated with this release.
That makes it easy to test. But you can provide your own kernel and all the kernel options that are needed to make a kernel that is compatible with boot to container, they are all specified. The only quirk is that you cannot have modules, everything has to be built in.
But this is something that is going to be addressed in the future. Then I allocate the drive, which is a one gigabyte drive, super simple. And I start QEMU. So, I say that I want to use the disk, I want to use this kernel,
I want to use boot to container, I want to have a cache device, so pick any drive that is there, well, there's only one, so it's going to be easy for, it's an easy job for it. Then I wanted to get the time at boot, and then I wanted to start Alpine,
again, in interactive shell. So, here we go. So, the command line is here, and let's start it. So, let me scroll back up. So, here we have, very simple,
a Linux console. Then, if we scroll down until init has started, then we see your root written in big letters, so very good. Then we see some runtime information about the machine, so it's a QEMU virtual CPU for x86-64,
and we have 358 megs of RAM, and one gig of storage. So, then, what we can see here is that it tried to find a cache partition on the machine, but since it didn't find one, then it's going to create one on the disk,
it's devvda. So, by creating, it just formats the drive, creates a partition, x4, and then it formats it. That's simple. Then it says, well, the cache partition devvda is mounted as cache, very good.
Then it connects to the network, so it's just that. It finds one network interface, and then it starts it, it connects to it, it gets a lease. Then it synchronizes the clock using poolNTP.org. It gets that in two seconds.
And, finally, it well, starts the container, and here, so it has been first pooling the container here, and that worked, and then it's been now ready. So, if we do an APK update, for instance,
it works. So, the first boot is a bit slow, as you can see, it took 20 seconds before running the actual container, but then subsequent boots are going to be much faster. So, I'm just exiting. Okay, and then starting again,
and since it's not going to have to format anything this time, then it's going to be much faster. See, it took only 8 seconds this time. And, again, everything is working. Okay, then that's the end of the demo.
So, what if we want to have a real-world idea about how it's going to look? Well, here is one. So, these commands you've already seen. Then this is saying, hey, I would like to register a Mineo S3 wall storage.
So, here I just put these variables like this, because, well, they're not hard-coded. Then, I want to create a volume that is called Job, and I want to mirror it from, okay, the Mineo instance Job, which is maybe a bad name,
and then the name of the bucket there that is going to be set for the job. Then I want the container to pull what is in this bucket when the pipeline starts. So, pipeline is because you can run multiple containers one after the other. So, here it's basically at boot.
We are asking boot to container to push to the bucket every change that has been done locally. So, that means that if the machine dies, then we'll have the last updated state. And we say that at the end of the pipeline, we want the volume to be deleted. Then we have two boot to container calls.
The first one is verifying that the machine has not changed since the last time, so that we booted it. So, we have a database, and we can verify that no hardware has changed. And then the second one is calling IGT, which is a kernel test suite for the graphics subsystem.
And we just say I want to mount the job volume to slash results. And then we're going to call the IGT runner, and say you're going to output the results to slash results.
And that means that as we run, we're going to have the results streaming to the bucket. And finally, we just use a serial console so we can see what is happening in real time. And that's it. Nothing too interesting there. So, what are the other use cases
that there could be for boot to container? So, one that I can see is having a fleet of automated systems that are either local to wherever you are, or deployed in remote places. So, net booting is feasible with boot to container
because you only need to download about 50 megs for both the kernel and boot to container. And then it's only the initial download of the layers, and every time we're going to reboot after that, the layers are already there, so we don't redownload everything. Then every boot that we're going to have
is going to behave the same as if it were the first boot, which is great for testability and QA. Then that also means that we don't really need a local IT, except if some hardware is misbehaving, so we can just replace it.
So, it's really plug and play. And for examples of these deployments, they could be public transport screens, either in buses or at bus stops, or if you have a chain of shops, that could also be meaning that they don't need to maintain the machines. They just plug them to the internet, and then it downloads everything.
Another use case is server provisioning in the cloud, but I'm sure they have their own system. I'm sure you also know more about this than me. So, this is basically if you have ideas about where it could be used, or if you have plans to use it, then please let me know.
Okay, so as a conclusion, our graphics CI needs were that we needed reproducibility of results environment and CI infrastructure, a test environment. We needed reliability and simplicity, and so having our own root FS was not going in this direction,
because we would have needed too many. Then boot to container has delivered on these requirements and brought a bit more. It's super easy to deploy anywhere, either locally or remotely, as I was saying, and it has a low maintenance cost, because if you need to upgrade it, then the only thing you need to do is bump the version.
That's that simple. So, the future work is that we're going to add support for the most common architectures, like ARM64, and anything else that is supported by Go and Alpine. Then we would like to actually replace the shell scripts
with code written in Go, and we would like to reduce the size of the init or MFS by merging the different Go binaries, especially MCLI and Podman, so they would not have duplicated code in memory. And, yeah, there's a couple more things,
but that's roughly it. Here are some links, if you're interested, and thanks for listening to me. I'm now going to be available for questions.
Well, I guess we won't know for another 10 seconds. I think we should just go for it. Wait one second.
Okay, I assume now everyone can see us. There is a big delay, so this needs a bit of getting used to. Okay, we are in the Q&A session. Thank you so much for your talk. That was really interesting,
and we have a few questions coming in. So the first one is by Daniel, and he wants to know what do you use for your S3 access. So I've been using MCLI, so Mina, your client, and that's it. I was wondering if I should make my own
or something like this to make it smaller, but MCLI has been quite compressible, and my hope is that when I'm going to merge MCLI and Portman into your route, then all the dependencies are going to be duplicated, so it's not going to cost me anything. So that is my hope.
We have another question, and that's what do you think of Rust? The network stack of Portman 4 has been rewritten in it as opposed to C, or for example, Go. Well, I would prefer if it had remained in Go, because again, the duplication, but in boot your container right now,
I only expose a host network, because we only run one container at a time, then it makes it a little useless to have more than one network. And so if you want to make something clunkier, you can have a container starting multiple containers, which is actually what we do in our CI
because we use boot to container also for our CI gateways. Cool. So after this, you can have the full Portman, and at this point, I don't give a shit about the usage because it's not re-downloaded every time. But otherwise, I mean, whatever floats upstream,
as long as it doesn't make everything too big. The next question is when is the data that is generated in the Docker volumes transferred to the S3 bucket, real-time or on shutdown of the container? So you have a lot of conditions. There is one that is called pipeline start.
So, okay, no, two things. You specify when you want to pull and when you want to push. And for both pull and push, you have these conditions. Pipeline starts, so that's at the beginning when booting, then container start because you can run multiple containers, one after the other, then container end, well, when it's done, so between stages of the pipeline,
or pipeline end. And then you have changes that's gonna be like mcli dash dash watch. So whenever there's a change that is local or remote, then it's gonna sync. Yeah. So you choose. There's no, I didn't want to hard code it for my use case.
Instead, I just specified it in the command line. That's the theme. I think we have two more questions that I currently see on the screen. Hopefully we can get through them. And afterwards, if you have more questions, you should join the private chat room. I think that opens up to the public and continue the discussion over there.
So the next question is, do you do much QEMU testing with boot to container, PCI pass through and so on? I did not. Our objective has been always real machines. So we are using real x86 machines and soon we'll add support for,
I mean ARM and other things. So yeah, that's what it is. Okay. And how do you manage the kernel versions to test? So basically now this is something more related to orchestration,
what needs to be booted or not. And I can link you to the so-called valve-intra. So the CI, the valve-graphics-ci-intra. And basically what you see with boot to container, the kernel command line, then this is something a bit more expanded in a YAML file that would show
exactly how to deploy things. And I can show you after. Okay. I think we're going to be cut off in about four seconds. So thank you very much for the talk. Continue the Q&A session in the private chat room. Goodbye.