Containers without a Container Manager, with systemd - TIB AV-Portal

Containers without a Container Manager, with systemd

00:00

8

Chaos Computer Club e.V.

Poettering, Lennart

Formale Metadaten

Titel

Containers without a Container Manager, with systemd

Serientitel

All Systems Go! - The Userspace Linux Conference 2017

Anzahl der Teile

47

Autor

Poettering, Lennart

Mitwirkende

All Systems Go!

Lizenz

CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.

Identifikatoren

10.5446/37924 (DOI)

Herausgeber

Chaos Computer Club e.V.

Erscheinungsjahr

Sprache

Produzent

All Systems Go!

Inhaltliche Metadaten

Fachgebiet

Genre

Abstract

Systemd service management today supports a number of the features that container management is known for, but for classic system services. Let's see which ones, and how to make use of them.

All Systems Go! - The Userspace Linux Conference 201739 / 47

1

39:56

Which network to use when - Socket Intents

2

10:33

What's in a container? The OCI Answer

3

21:02

What If Component xxx Dies? Introducing Self-Healing Kubernetes

4

42:09

Virtualization: what changed in the last decade

5

29:54

Using systemd for containers @ Facebook

6

23:16

Using BPF in Kubernetes

7

25:45

Updating Embedded Systems - Putting it all Together

8

31:12

Update on new WiFi daemon for Linux

9

30:39

Unbreaking reloads: strategies for fast and non-blocking reconfiguration

10

28:42

The IoT botnet wars, Linux devices, and the absence of basic security hardening

11

22:17

Tango with systemd

12

40:25

Systemd @ Facebook - a year later

13

46:43

Synchronizing images with casync

14

26:54

Streamlining systemd's code and safety

15

24:46

State of the rkt container runtime

16

25:07

Software updates for connected Linux devices: key requirements

17

25:44

Simulate hardware for integration testing

18

23:53

Securing Home Automation with Tor

19

37:03

Reproducible Builds - where do we want to go tomorrow?

20

24:00

Rust memory management

21

33:48

Really crazy container troubleshooting stories

22

41:12

Portals, dynamic permissions in Flatpak

23

06:59

All Systems Go! 2017: Opening

24

18:09

Network troubleshooting in heterogeneous cloud environment with Skydive

25

26:57

Modern deployment for Embedded Linux and IoT

26

38:18

Meson and the changing Linux build landscape

27

19:45

Landlock LSM: Towards unprivileged sandboxing

28

30:40

Kubernetes for toasters?

29

22:16

Kube-spawn: testing multi-node Kubernetes clusters on Linux systems

30

24:37

Journal as a Storage and Other Adventures in User Session Recording

31

40:35

Introducing Bluetooth Mesh

32

39:27

Insecure containers?

33

14:29

Incremental Adoption of Open Services with Habitat

34

24:37

High-performance Linux monitoring with eBPF

35

42:52

Getting Started with Habitat

36

37:34

Fix, forget, or forge a new path?

37

49:45

38

28:34

Creating your own 1password clone

39

31:51

Containers without a Container Manager, with systemd

40

17:12

Containers: What Did We Learn?

41

25:49

Cockpit: A Linux sysadmin session in your Browser

42

14:32

All Systems Go! 2017: Closing

43

40:46

Cgroupv2: Linux's new unified control group hierarchy

44

16:08

Building containers all day

45

29:38

Building a secure boot chain to userland

46

29:15

Azure networking integration challenges

47

25:54

A gentle introduction to [e]BPF

Automatisches Abspielen

Sprache

Text

Bild

00:00

SystemprogrammierungPhysikalisches SystemUmwandlungsenthalpieInhalt <Mathematik>StabAnpassung <Mathematik>DatenverwaltungDienst <Informatik>NormalvektorRichtungXMLComputeranimationVorlesung/Konferenz

00:42

SystemprogrammierungBootenSpezialrechnerMagnetkartePhysikalisches SystemAggregatzustandWeb logVerzeichnisdienstFaserbündelDateiverwaltungRohdatenPhysikalisches SystemElektronische PublikationPartitionsfunktionKryptologieSocketLaufzeitfehlerMathematische LogikBildgebendes VerfahrenInstantiierungEinsVerzeichnisdienstWurzel <Mathematik>RoutingLoginRechenwerkFaserbündelBootstrap-AggregationProgrammierumgebungMathematikTabelleRegulärer GraphBitMini-DiscWeb ServicesKonfigurationsraumWrapper <Programmierung>AggregatzustandCachingLoopDifferenteMapping <Computergraphik>SchnittmengeUmwandlungsenthalpieProgrammierungSchnelltastePrototypingVersionsverwaltungDatenverwaltungTypentheorieLesen <Datenverarbeitung>BijektionProxy ServerNotebook-ComputerEinfache GenauigkeitQuellcodeFokalpunktFunktionalArithmetisches MittelVarianzSichtenkonzeptMereologieBoolesche AlgebraGraphische BenutzeroberflächeProjektive EbeneProgrammbibliothekNetzbetriebssystemChiffrierungEigentliche AbbildungWurzelsystem <Mathematik>Elektronischer ProgrammführerROM <Informatik>Güte der AnpassungBildschirmmaskeVektorraumInhalt <Mathematik>Rechter WinkelNilpotente GruppeGemeinsamer SpeicherFestplatteZeitballResultanteInstallation <Informatik>ServerZusammenhängender GraphComputersicherheitMetropolitan area networkNamensraumDienst <Informatik>Streaming <Kommunikationstechnik>Freier LadungsträgerFlächeninhaltSchlüsselverwaltungComputeranimation

10:32

SystemprogrammierungTabelleFaserbündelAggregatzustandWeb logVerzeichnisdienstProgrammierumgebungIdentitätsverwaltungDienst <Informatik>Web ServicesDienst <Informatik>DatenverwaltungServerDistributionenraumWurzel <Mathematik>Konfiguration <Informatik>VerzeichnisdienstBimodulTabelleGraphiktablettNamensraumPhysikalisches SystemProzess <Informatik>Bildgebendes VerfahrenLokales NetzAdressraumProgrammierumgebungBitMereologieForcingFunktionalRegulärer GraphSystemaufrufEinfach zusammenhängender RaumAtomarität <Informatik>Boolesche AlgebraZellularer AutomatDatenbankRechter WinkelMinkowski-MetrikElektronische PublikationRoutingMessage-PassingEndliche ModelltheorieInhalt <Mathematik>XML

14:36

SystemprogrammierungDienst <Informatik>FaserbündelSchnittmengeDynamisches SystemKlassische PhysikWeb ServicesProzessfähigkeit <Qualitätsmanagement>Physikalisches SystemQuaderBoolesche AlgebraPlastikkarteRechter WinkelPhysikalismusSoundkarteEinsPeripheres GerätProzess <Informatik>InstantiierungBitRechenschieberSoftwarep-BlockDateiverwaltungTabelleElektronische PublikationMereologieCodecDatenbankMathematikHardwareSichtenkonzeptFlächeninhaltKontrast <Statistik>Konfiguration <Informatik>Strategisches SpielVerschlingungFaserbündelVerzeichnisdienstObjekt <Kategorie>ComputerspielDreiecksfreier GraphResultanteDifferenteDienst <Informatik>SystemverwaltungTelekommunikationServerMultiplikationsoperatorMailing-ListeNummernsystemInformationInstallation <Informatik>Bildgebendes VerfahrenEntscheidungstheorieTropfenDistributionenraumGruppenoperationSchreiben <Datenverarbeitung>Ganze FunktionEindringerkennungFokalpunktWurzel <Mathematik>Arithmetisches MittelNotepad-ComputerFirewallComputersicherheitVersionsverwaltungDefaultXML

23:11

SystemprogrammierungPhysikalisches SystemGruppenoperationSystemaufrufElektronische PublikationSpezialrechnerMobiles EndgerätPhysikalisches SystemSoftwareschwachstelleBildgebendes VerfahrenApp <Programm>CodeResultanteMultiplikationsoperatorWeb Servicessinc-FunktionFirewallRechter WinkelBoolesche AlgebraGruppenoperationAdressraumQuick-SortDateiverwaltungVerzeichnisdienstMapping <Computergraphik>BitFluidKernel <Informatik>Komplex <Algebra>GeradePCI-ExpressMatchingLie-GruppeNamensraump-BlockWort <Informatik>HardwareVerschiebungsoperatorDatenverwaltungMessage-PassingNichtlineares GleichungssystemInformationsspeicherungCASE <Informatik>AssemblerRechenwerkMailing-ListeFamilie <Mathematik>DifferenteWurzel <Mathematik>UmwandlungsenthalpieRoutingFilter <Stochastik>MathematikDienst <Informatik>Generator <Informatik>VersionsverwaltungInhalt <Mathematik>Gebäude <Mathematik>Elektronische PublikationPunktBimodulRandomisierungEnergiedichteArithmetische FolgeFaserbündelMobiles EndgerätAggregatzustandRechenschieberTabelleTropfenDatenbankProgrammierumgebungGesetz <Physik>ATMStrömungsrichtungVerdeckungsrechnungProtokoll <Datenverarbeitungssystem>GamecontrollerSkriptspracheSystemaufrufLastProxy ServerHook <Programmierung>Socket-SchnittstelleNetzadresseFitnessfunktionMomentenproblemVererbungshierarchieKonfiguration <Informatik>XMLUML

31:46

SystemprogrammierungXML

Transkript: Englisch(automatisch erzeugt)

00:06

OK, then let's begin. I only got 30 minutes, so I'll try to be quick. I'm going to talk about one specific facet of System.ly today, which is basically containers

00:20

without a container manager. I mean, there have been quite a few container talks at this conference already, so I'm basically trying to adapt the stuff that containers do and adapt them to normal system services directly. So it's kind of doing containers without actually

00:41

being a container manager. What are containers again? Lots of people have lots of different definitions. Like, I think the three most relevant parts of what a container is are resource bundling, right? You have this one tar ball or this SquashFS image or whatever you have, and that contains all your dependencies.

01:01

So you get rid of the dependencies by simply bundling everything together. There's always sandboxing involved, like namespacing and security, like seccomp and these kind of things. And there's an important component, which is delivery, where you can actually distribute it on your cluster. For this talk, I'm just going to focus on the first two,

01:21

resource bundling and sandboxing. And I'm going to talk a little bit about how you can do these two things without actually involving a container manager at all, but just by using systemd's own service management functionality. So let's jump right in. The first thing, resource bundling. In systemd, since pretty much since its inception,

01:42

we had this setting root directory. It's a one-to-one wrapper around Chroot, right? Chroot is the prototypical pseudo-configuration, pseudo-containerization feature that Unix always had.

02:02

And yeah, it used to be semi-useful. And nowadays, it's actually pretty useful. Yeah, we'll come to that in more detail. What it ultimately does is it just invokes something with Chroot environment setup so that basically everything that shows up as slash there is not what the host sees.

02:22

Something that is much younger in systemd is pretty closely related to this, which is root image. Where root directory, where you specify a directory that shall be the root for that one specific service, root image, you can specify a disk image, like a binary blob that contains a file system of some form. Root image is actually pretty, really useful.

02:41

So the images that you can specify there can be completely regular disk images that you could also pass to QEMU or something like that. Everything that they need, they either need to be discoverable GPT, like GPT being the partition table logic and discoverable. By that, I mean that the partition types are properly

03:03

tagged as what they're actually used for, so that you can actually have a recognized simply by looking at the partition table which one is the root directory and which one is your home directory. It also supports unambiguous GPT or MBR, which is not discoverable. By unambiguous, I just mean that if you

03:20

have a partition table that contains one partition only, then it's pretty obvious that that's probably the root partition, right? So yeah, you can throw lots of different things at it. Either you avoid all the ambiguities or you make it discoverable so that it's clear what it is. Or you can just point a raw file system to it, right?

03:40

Like no partition table at all. You just generate something with squash, make squash effect, so you just create a loop device, put a file system on it. That's fine, too. One tool to create these images, of course, MKOS-I, but actually you can use whatever you like. You can just do bootstrap or yum, whatever you do. MKOS-I is a tool that I have been

04:00

working on in the past months. It's supposed to be like a wrapper ultimately around the bootstrap in DNF, but it has a couple of bells and whistles that make it a little bit nicer to use. For example, it can do cryptography for you, which is actually pretty interesting. Like this root image setting and system unit files

04:20

actually can do cryptography for you as well, right? You can actually encrypt the images that you want to run there, and system will handle that properly. I think encryption is not that interesting for service management, but something closely related to it actually is, I think, which is DM-varity. For those who don't know, DM-varity is a system that protects file systems from modification.

04:41

It was created originally for the Chrome OS project because they wanted to make sure that offline modification of the Chromebooks is not possible, meaning that you can leave your laptop in some unsupervised area, and people cannot just take out the hard disk, modify it, put it back in, and you would not notice. But instead, every single read access is cryptographically verified so that you detect changes.

05:03

How does that all apply to service management? Basically, if you use DM-varity-protected disk images, you can deploy your service on your systems and can be sure that when they run, they run in the exact version that you prepared and that nobody has interfered offline with them,

05:22

like, for example, during the downloading or while the system was already running. This is not useful for everybody, but it's certainly useful for a lot of people. Yeah, as I already mentioned, root image and root directory are just a fancy to root. In fact, root directory is ultimately implemented with a to root system call,

05:42

at least under normal conditions, not always, but usually. To roots are highly problematic in many ways. I mean, you can make them work if you know what you're doing, but they come with lots of problems. One of them is, of course, you first have to mount the API file systems into them,

06:02

like slash proc slash sys. Otherwise, the program will not actually run from this environment because it's not there. That's something we handle in system view is mount APIVVVVV, yeah, whatever. It's a Boolean. If you set it, then it makes sure that after to rooting into this thing, you also get the proxies and def file systems there

06:22

so that everything just works. So that's one thing. The other thing is how to share data. On Unix, there's bind mounts for that. Bind mounts are excellent. Traditionally, if you would use normal Chroot, how people usually traditionally used it is they would establish these bind mounts on the host. So it would always show up in the mount table of the host.

06:43

In systemd with unit files, you can use bind path and bind read only path, which basically allow you to map anything from the host into anything into the Chroot environment. And it will only show up in the mount table of the service itself, so it will not pollute your host. It's actually that easy to use, so I'm not going to go into detail with what you do there.

07:02

You just specify either one path or a path path that specify a mapping. What from the host should show up where inside of the Chroot environment? Pretty much related to this is a relatively new feature of systemd, like the set of runtime directory, state

07:20

directory, cache directory, logs directory, configuration directory. Because usually, if you want to ship your service as a bundle, like ideally a bundle that only contains the actual operating system executables, it's still interesting to actually have changeable data route that reside on the host system.

07:41

Specifically, you want something like runtime data, which is like a Unix socket or something like that. You want a state directory where you can actually, your service can put stuff and it stays around. You want a cache directory where your service can put stuff, which is non-essential data, so that if it's flushed out, it's not bad. If it's there, it's optimizing things.

08:02

You might want to configure a logs directory, which is where your service can put logs, and a configuration directory where it can put configuration. If you use these settings in unit files, together with the root directory or root image, then this will a little bit work like bind pass work. It's going to be mounted from the host into the root

08:24

environment. However, it comes with a couple of bells and whistles. These directories, the source directory is automatically created in their lifetime. Because the system knows about them, it can lifecycle them together with the service itself. For example, runtime directory, if you use that,

08:41

it will automatically create a directory for you that has this in slash run, which is where all the runtime stuff belongs, that is automatically lifecycle together with the service itself. So let's say you're nginx. You are packaged as one of these bundles. And then you want to have your run nginx directory.

09:02

And you can just specify it with runtime directory. And that basically means that the nginx slash run slash nginx directory has created the instance the service has created itself, and goes away automatically when the service is shut down. The other ones are similar to this, actually. But it's basically a way how you can bundle everything in a resource bundling

09:24

way, in a very nice way. But you can still share specific things and have them reside on the host, which is nice for updates, for example. Because if it's on the host, it will be unaffected by updates, or not as directly affected by updates. And you can update the bundles independently of that.

09:41

Yeah, these things are also pretty nice because they keep bundles self-contained, because traditionally, if you install a Unix service on some system, they will run things like temp files to you, or something like that, where they create additional directories in the file system, or any kind of place.

10:00

But if we actually are interested in the bundling concept, then it's kind of nice that we don't need to do that, at least if all we want to do is have a runtime directory, state directory, cache directory, locks directory, or configuration directory. By the way, so the runtime directory that was slash run is like a subdirectory of slash run that you configure that way. The state directory is a subdirectory of var libs

10:20

that you configure this way. The cache directory is a directory in var cache that you configure this way. Logs directory, you guessed it, probably is var log. And configuration directory is a subdirectory of etc that you configure this way. So yeah, that's that. A bigger problem, so this is how you can share data between having a bundled service

10:41

like this and the host, or other stuff. But the bigger problem with Chirrut classically is how to share the user table, right? Because on Unix, of course, the user table is usually maintained in sc-passwd. If you have a Chirrut environment, and the sc-passwd of the Chirrut environment

11:01

is different from the one from the host, because you still live in the same world, can become a bit of a problem. Because the idea of who use a LAN address on the host might be quite different from the idea that Chirrut sees that. This is only a problem if you use Chirrut without user namespaces and PAD namespaces. I mean, it's actually a problem that things like Docker

11:22

have as well, except that the Docker people usually don't tell you about this problem. And it's not as visible, because if you disconnect the PAD namespaces from each other, right, like if you can't see the process of the other users, it's not as visible that they still run as the same users.

11:41

But yeah, I mean, the general solution is to actually use the namespaces, which we'll talk about here, which aren't yet that adopted on Linux. I figure because they're hard to use. And if you ask me, they're kind of incomplete. But yeah, so the question is, again, what do you do if you have your bundled service and you want to use root image in a system

12:02

you service file? And so what do you do about the user database? My suggestion to do that is not share it at all. Instead, there's this Boolean option called private users for services. If you turn it on, this basically disconnects the user tables of the service

12:20

that the service sees from the one from the host. This is ultimately implemented with user NS. But instead of pretending that user NS was a solution for everything and exposing the full functionality, it will expose it in one very, very specific way. So what it does, it will basically install a mapping so that the root user of the host

12:42

shows up as the root user that the server sees. The nobody user of the host will show up as the nobody user that the server sees. The user of the service itself will also be mapped like this. And everything else is mapped to the nobody user. This basically means it doesn't really matter what the bundled thing actually has in sc-path-wd,

13:04

because we don't really care. We only care for the root user, for the nobody user, and for the service user itself. And the root user and the nobody user is actually the only one where all the distributions tend to agree which user ID they actually have. User ID root always has UID 0, and user ID nobody

13:26

has user ID 6553, whatever. You get the concepts. So with private users, you can disconnect that. So all the other users, like the other regular users that you might see in PS or something, don't actually matter anymore.

13:41

And then there's another module called nssSystemD, which synthesizes user entries for root and nobody. Which basically means if you have this nssSystemD enabled, which the distributions increasingly have, then you don't actually need sc-path-wd at all, because these users which everybody agrees on will exist anyway, regardless if sc-path-wd exists or not.

14:04

Because nssSystemD is a module that is loaded into the user management of Linux and will make sure that they always show up. There's one piece missing in this. It's like if you have a bundled service, if you have a service that uses root directory or root image,

14:21

how do you make sure that from inside of this environment you actually see that the user ID you're running has a specific name? I have some ideas about this. It's going to be very technical, so I'm going to skip over this bit, but yeah. So much about the bundling. The essence of everything I told you really

14:42

is use root directory and root image if you want bundling with normal services. It should just work, and you can use standard images. And with the private user thing, you can deal with the user database change. But yeah, the other part of containers, right, besides the bundling is, of course, sandboxing.

15:01

And sandboxing is something we added a lot of features recently to systemd. Yeah, basically all my remaining slides are just about specific sandboxing features. We'll go quickly through them. Like one, like I blocked about this one. It's actually one of the more interesting ones. It's, you know how on classic Unix,

15:20

services used to be sandboxed, right? Like it's all about user IDs. It's like how we have been doing since the 90s or even before. Like the Apache user has had a Httpd or something. Apache is running as that, and because it's not root that it's running as, it cannot access whatever else is happening on the system.

15:40

And traditionally, this is how we put together our Unix systems, right? Every system service we had, had its own user ID it was running as, and was thus isolated to some way from everything else. It is, if you so will, the quintessential sandboxing technology that Unix always had. It kind of, I mean, it's widely adopted, but it's also, it's kind of frozen in time, right?

16:03

It has this problem that it's very expensive to actually allocate a user because the system users that there are, they, I mean, most of the distribution define that you can have at most 1,000 system users. So if you install 1,000 services or something, you become a problem, you have a problem.

16:21

This basically means that you cannot just allocate users traditionally just like that, use them for something and then release them because there are simply too few of them to actually do this. And even if you did, there's a general problem on Unix that there is no scheme to actually release the ownership of a system user ID again,

16:41

because you have the problem that user ID ownership, like the ownership of a file directory or IPC object or whatever else Linux maintains, is bound to user ID, like a numeric ID. So at the time you create that object, the object becomes owned by that user ID.

17:01

Now, if you wanna reuse the user ID for different purposes because you only have 1,000 of them, you would have to first make sure that you have to release the original resource, like a file directory, IPC object, and so on. But that's incredibly hard because you would have to scan the entire file system for this. And what do you do if a user owned a file

17:21

and some file system is currently not mounted, so you cannot really properly solve that. So most distributions hence, they just declare for safety reasons, we'll never actually release user IDs again. So if you install a package and you remove it, then most of the files are removed, but the system users that are allocated are not.

17:42

So you leave major artifacts in the system and given that there are only 1,000 of them, that's pretty nasty. In the system 235, the most recently released one, we have the dynamic user concept, which basically use a couple of tricks to make this all more bearable, right?

18:02

So it's a Boolean option. If you turn it on for a service, it basically means that the instant the service starts, a new system user is allocated and the instant the service shuts down, it's released again. How do we deal with these problems that I mentioned that the fact that user ID ownership is sticky on Unix?

18:23

There are two strategies to that. One of them is we forbid creating objects in most ways. This basically, for example, means that we use a couple of other sandboxing options that I've talked about later that basically ensure that the service has very few directories it actually can write to.

18:41

And if it can't write to anything, it of course cannot actually leave objects around owned by this user. Another strategy is to define some specific areas where the service can write to after all, but then destroy these areas the instant the service goes down, right? Specifically, that's private TMP, for example. It's a simple Boolean that is set for the service

19:00

and it's actually implied if you set dynamic user for the service. And it basically means that the service, as long as run has a private directory in slash temp, that appears as its own slash temp and that goes away automatically when the service goes down. So two strategies, forbid writing, and we do write, make sure that it is removed again afterwards.

19:22

So that's our strategy there. Yeah, dynamic user is also pretty nice because it keeps bundles self-contained, right? Like traditionally, if you install a system user, you drop in assist users D drop in or you invoke add user or something in your RPM or something like that. But yeah, that basically means you need to distribute stuff

19:43

in the whole file system. And this way you don't have to do that because the service file contains all the information about the user that needs to be allocated. And it's nicely self-contained and it leaves no artifacts in the system.

20:00

So yeah, the focus is really on leaving no artifacts. One other sandboxing of object concept pretty closely related to this actually is remove IPC. It basically just says that system five and POSIX IPC objects that are created by the service get automatically removed when the service goes down.

20:21

You know, POSIX, the IPC systems are usually not that visible to administrators but it's how processes communicate on Linux. If it's a Boolean, it's also implied by dynamic user but you can use it for everything else as well. If you set it basically and the service goes down, we iterate through the list

20:40

of currently allocated IPC objects and remove every single one that matches the user ID that your service ran as. Whereas private TMP, I already mentioned that gives you this private slash temp that is lifecycle bound to your service itself. The result of this is again, no artifacts left, right? Like you start the service, you shut it down and all your temporary files

21:00

and all your IPC objects go away with the service. There's another option which is private devices. I mean, you know, all these options, like they are much more fine-grained than what you traditionally can do with containers, right? Like containers are by default locked down very much and very much disconnected from the host. You don't see the process table, you don't see the user table, or at least you think you don't see the user table

21:22

but actually you do. You don't get access to devices and things like that. In systemd, because we're coming the other way, traditionally the services run without, like with most privileges because that's how system five and it works. We do go the other way around. We lock things down bit by bit. I wish it wasn't so, of course, because security is always better

21:40

if you're coming from the lockdown version and bit by bit opens things up. But we can't due to the system five heritage. But still like, so yeah, so these individual bits, if you take them, use them bit by bit, you can build a very nice sandbox but you of course have to turn them on all individually.

22:03

Yeah, private devices basically gives you a private instance of slash dev that doesn't contain any real devices, right? What it does provide you in slash dev though is the pseudo devices like dev null, dev zero, dev random, dev viewrunner, which aren't real devices.

22:20

Like, I mean, there's not a physical PCI card or something behind that. It's just a way how Linux likes to expose its APIs. So, and in contrast, so like, they're in contrast to let's say dev SDA, which is actually a physical device, it's your hard disk, or to dev sound whatever,

22:41

which is actually your sound card. So with private devices, basically you get disconnected from that. You still get the API character devices, but you do not get anything else. It's like, unless your service needs actual physical hardware access and almost no service does, it's the great Boolean to set. There's private network, which uses network name spacing

23:01

to disconnect you from the host network. For every service that doesn't need networking, it's a great thing to do. Very recently we added something more fine-grained, which is a little bit like a firewall where you can basically configure for your service which IP addresses shall be able to access. You can specify that simply by IP address

23:21

and the net mask and it just works. There are a couple of more things like that. For example, protect-kernel-tunables takes away the access to Proxis for the service. Protect-kernel-modules takes away the access to load kernel modules for a service. It's all Booleans, by the way. Protect-control-groups takes away the right to make changes to the control group file system.

23:42

Yeah, then there's a system called filter, which allows to apply a specific system called filters to a service so that it can lock it down, so that dangerous system calls, like, for example, setting the system clock or rebooting the system, are not available. It's pretty hard to use, or traditionally it's pretty hard to use because who actually knows all the system calls

24:02

that you want to list there? It's a lot simpler now because we have system called groups which are basically named groups that make it easier to enable and disable specific facets. I got like five minutes left now. There are quite a few more,

24:20

and I think it's not bad at all that we can't talk about all of them. I'll just quickly, like, with one of them you can restrict address families, like socket address families. With one you can restrict the system call architectures, blah, blah, blah, blah. The message you should get from all of these are we have all these sandboxing options these days

24:41

and you can use them like, I mean, much of this, not all of this, but much of this is applied by a container manager as well to the containers it's running. But the message you really should take home is that if that's what you're in for, if that's what you're looking for, then you can just do that for normal services as well.

25:01

Just set these Booleans on bit by bit and you can run your stuff in a very locked down version. So, yeah. This is not supposed to be a replacement for a container manager, not at all, right? But the reason why I'm doing this talk is mostly because I work for Rat Hat, right? And I know, like, I come in contact

25:21

with lots of people who use containers for various different things, right? Like, because containers are like the big word, everybody tries to fit his specific problem into the container world. Like, for example, I met with storage people who wanna ship their storage management stuff as a container. And that's certainly a great thing to do until you notice that,

25:41

well, if you wanna manage storage, you actually need hardware access, like block device access. And as soon as you do block device access, you become really, really hard with Docker because it's not designed that way, because it actually is supposed to take away the rights for you. So, and then there are lots of stories like that where people see containers as the solution for everything

26:01

and then try to fit it into the problems that they have. Most of these times, they just say, okay, you're actually interested in the sandboxing. Or A, you're interested in the bundling, but you're not actually so much interested in the rest of it. The message I really wanna get across is that, yeah, it's a fluid thing, right?

26:22

Like, maybe containers are actually not the solution for you, maybe you can use just plain service management and turn on the sandboxing and there you go. Or maybe you can use plain service management and turn on the resource bundling and there you go, and it solves your problems as well. Now, I think I got like four minutes left, so maybe we should do questions if anybody has a question.

26:41

There's a question. Based on the current state of systemd, how far do you think you're along to the path of portable systems? Portable system service, you mean? Well, my last slide here was about the outlook for that. So, portable service is something I've been working on

27:02

in the longer run, which is supposed to be something where you really just can drop in a service, a bundle, like an image file, and then systemd will deal with the rest of it. Basically, it's way how everything that I presented on my slides is just pulled together in one tool and makes it nice.

27:21

At this point, we're basically there, right? Like, all the individual building blocks that I want for the portable services are there. It's just a matter of writing this generator that looks at the image files you drop in, pulls out the relevant service files, makes them available on the host system, points them back so that root directory or root image is used onto the original image,

27:42

so that they appear as a native service. So, it's mostly there. It's just about writing this generator to make it all, like, fit it all together. It's one of the things that I have on my tool list next, basically. It was a long way to go there and to get there because, I mean, adding all the sandboxing features,

28:00

adding all the image handling fixes, adding, like, figuring out what we actually want to do with the user database and root environments and things like that, that was a lot of work. But nowadays, it's really, it's pretty much just actually doing the generator. Oh, by the way, something that's also really important to mention is, like, because these bits are so fine-grained and you can pick exactly what you want,

28:22

what you can also do is, like, use this to hook random other stuff up to systemd and make it run as a native systemd unit. Like, for example, you could probably just use, write a generator that now uses an OCI image or something like that and dynamically converts it into a unit file

28:42

with the generator. I mean, for those who don't know what generators are, generators are a systemd concept how you can convert dynamically foreign stuff that wants to run as a service to into systemd unit files. We originally created that to convert system five units, like, system five init scripts dynamically into systemd units, but it's actually

29:02

way more powerful than that. You can use it for all kinds of other stuff. So, what I basically wanted to say here now is that while I think the portable services are a great way forward, none of this technology is specific to that, right? Like, you can stick it together in completely different ways, write a generator from UCI to this and we'll work to.

29:23

Any other questions? No, any other questions? There's a question. You said that user namespaces were kind of incomplete. Could you elaborate a bit on why you think that is? Why it is or how it is? How it is.

29:41

I don't know, user namespaces have been around for a while, but I don't think they, I mean, you can make them use for specific use cases. For example, Flatpak is probably one of the more sensible uses where it's being used, but in general, you know, we are lacking a shift file system, like a huge shift file system, so it basically means that whenever you actually want to use

30:00

huge namespaces the way they originally intended them to be used and you have to shift around all the UADs in your image because otherwise everything will be owned by user nobody and that's usually not how systems work, right, and the fact that you have to shift around, right, that you have to do a recursive shown is just awful, that's not solved to the end.

30:20

Other than that, it's probably the major security vulnerability in the kernel in the last months or years or something, right? I don't know, I don't think it, I mean, it's super complex. I don't think it's solved to the end. I think it's very hard to use. I think it's way over-designed because it allows arbitrary mappings from any user table to any other user table.

30:41

I also think it's a problem that suddenly systems have much smaller user tables, right, like because you always have to slice up the 32-bit you have into smaller bits and I don't know, I mean, we make use of it like private user, the Boolean uses it,

31:01

but I don't think it's solved to the end and I don't think there are any deployments, right? Maybe Docker has code for it right now, but I'm not entirely sure that they actually, people run it in the full mode how it's intended to be. It appears to be very much like something is still in progress and has been in progress for the last five years or something and probably will continue being in progress for the next five years or something

31:21

until we get a shift FS or something in the kernel which doesn't look very likely at this moment as far as I know, at least. Any other questions? Okay, I think the time's over. Thank you very much for your time and Chris will probably do more announcement now so please stay.

31:40

Yeah, okay. Thank you. Thanks Leonard.