We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Revamping libcontainer's systemd driver

00:00

Formal Metadata

Title
Revamping libcontainer's systemd driver
Title of Series
Number of Parts
44
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Producer

Content Metadata

Subject Area
Genre
Abstract
In this talk, I'll go through my efforts to revamp libcontainer's systemd driver, in particular to support the unified cgroup hierarchy. libcontainer is part of runc (opencontainers/runc in GitHub) and is used by the Docker and containerd ecosystem to spawn containers. This work is trying to bridge the gap between the Docker/containerd/Kubernetes ecosystem and cgroup2 through the unified hierarchy, using systemd as an authoritative container manager. I'll also touch on alternative approaches (such as crun and systemd-nspawn) and briefly talk about the OCI standard and the need for it to evolve to properly support cgroup2 semantics.
Device driverFacebookBitWebsiteDevice driverGroup actionMereologyPhysical systemComputer animation
FacebookState of matterSpacetimeHierarchyAsynchronous Transfer ModeDefault (computer science)GUI widgetGoodness of fitInterface (computing)Game controllerHierarchyHybrid computerDefault (computer science)Execution unitDifferent (Kate Ryan album)State of matterAsynchronous Transfer ModeNetwork topologyGoodness of fitSpacetimeRevision controlBitGroup actionPhysical systemKey (cryptography)Queue (abstract data type)Component-based software engineeringSemiconductor memoryLoop (music)CubeLibrary (computing)INTEGRALBus (computing)WeightWeb 2.0Computer animation
System programmingComponent-based software engineeringHierarchyLink (knot theory)Communications protocolConnectivity (graph theory)Data managementSource codeRun time (program lifecycle phase)Demo (music)Network topologyImplementationHierarchyFront and back endsProjective planeLibrary (computing)Server (computing)Streaming mediaComputer animation
System programmingMaxima and minimaPhysical systemBus (computing)CausalityMereologyGroup actionAbstractionDevice driverConnectivity (graph theory)Interface (computing)Set (mathematics)Operator (mathematics)Limit (category theory)MathematicsDevice driverNamespaceFile systemComputer fileComputer animation
MereologyDevice driverNetwork topologyFacebookCodeKernel (computing)FreezingMereologyDevice driverKernel (computing)CodeMultiplication signImplementationLimit (category theory)Bus (computing)Revision controlPhysical systemGroup actionQueue (abstract data type)Computer animation
System programmingInterface (computing)Multiplication signDevice driverHierarchyPhysical systemCASE <Informatik>Covering spaceImplementationHybrid computerCodeGroup actionSqueeze theoremNetwork topologyComputer animationProgram flowchart
Interface (computing)Execution unitComputer fileFacebookKernel (computing)Physical systemRevision controlHierarchyDevice driverExplosionNetwork topologyGame controllerComputer configurationView (database)Group actionSystem programmingRead-only memoryLimit (category theory)Distribution (mathematics)Group actionStandard deviationGame controllerClient (computing)Virtual machineHierarchyExecution unitImplementationLatent heatSingle-precision floating-point formatInterface (computing)Revision controlPoint (geometry)Device driverMachine visionMultiplication signView (database)Program slicingService (economics)Direction (geometry)1 (number)Communications protocolLevel (video gaming)RootSoftwareData managementSystem callConnectivity (graph theory)Endliche ModelltheorieNetwork topologyExterior algebraLimit (category theory)Kernel (computing)CASE <Informatik>AdditionAttribute grammarFreezingMedical imagingFile formatConstraint (mathematics)Library (computing)Queue (abstract data type)Focus (optics)Instance (computer science)Physical systemBus (computing)Line (geometry)CubeVideoconferencingShooting methodForcing (mathematics)FreewareComputer configurationExtension (kinesiology)Electronic signatureWeightComputer animation
System programmingCASE <Informatik>Game controllerPhysical systemHybrid computerDirection (geometry)Multiplication signSemiconductor memoryQueue (abstract data type)Limit (category theory)Group actionRight angleMathematicsMassHand fanTracing (software)MereologyCivil engineeringArithmetic meanLibrary (computing)Point (geometry)Parallel portMatching (graph theory)Euler anglesConnectivity (graph theory)Asynchronous Transfer ModeFacebookVector potentialProgram slicingProcess (computing)Level (video gaming)PhysicalismMaxima and minimaINTEGRALExtension (kinesiology)Single-precision floating-point formatCubeInformationSemantics (computer science)NamespaceRadio-frequency identificationInstance (computer science)Wrapper (data mining)ImplementationBus (computing)Computer animation
HierarchyComponent-based software engineeringSystem programmingSlide ruleLevel (video gaming)BitConnected spaceVideoconferencingEstimatorGrand Unified TheoryFilm editingGreen's functionGroup actionBlock (periodic table)Queue (abstract data type)Information securityComputer animation
System programmingWebsiteLattice (order)Computer animation
Transcript: English(auto-generated)
So we have Felipe, who's going to give us a talk about revamping the container system to the driver. He works for a small blue website, which some of you may be using. But Felipe, I think you're a brand new star there. So why don't you take it away? Thanks.
Hi, all. I'm Felipe. I'm part of the Facebook delegation. Tell you a little bit about me. So I work for Facebook. And I care a lot about cgroup2 and about containers. I've actually been with Facebook recently. I started working in June, so very recently.
Previously, I was working on Kubernetes. I was working at Google, actually. And I was working on kubelet and focusing on cgroup2. And I've been a systemd contributor since 2014. So what's this talk about? So this is about containers in the Docker Kubernetes world.
There's different approaches for containers. There's systemd and spawn. There's lxc. And there's Docker's Kubernetes. And this is about that world. It's about cgroup2 as well. And it's about systemd. So a little bit about the state of the cgroup world.
So this state, systemd, can be considered as the main user space API to cgroups. You can definitely talk directly to the kernel, like to the cgroup3. But systemd is trying to expose this in a nice-to-use interface through DBUS or through units.
And there are kind of three-ish modes for the cgroup hierarchy. So there is the legacy mode, which is cgroup only. Then there's the hybrid mode, which already mounts the cgroup3, but doesn't really use it for any of the controllers. And then there's the unified mode,
which is cgroup2 only. There's kind of like, with the hybrid, you can actually move some of the controllers into the cgroup2 tree, like enable them in the cgroup2 tree. But that's not something anyone is doing anyways. Hybrid is where most people are right now, because since systemd versions of two or three years ago,
it's been fairly stable. But that means people are still using cgroup1. And unified is where we would like to be with the controllers using cgroup2. So motivations for why we want cgroup2.
And so the hierarchy is better. It's saying delegation works better. There's new controls, like Dan was talking a little about memory.low. That's something fairly new to cgroup2. Also, io.wait is coming to cgroup2 as well.
So cgroup2 is already like, there is a lot of improvements that are coming to cgroup2 only. All this eBPF goodness is coming to cgroup2 as well. So Alban talked about traceloop earlier this morning. That's cgroup2. My colleague Julie is coming to talk about BPF more later.
And all this is connected to cgroup2. Fedora 31 is going to be using the unified hierarchy by default. So that's a strong motivation to solve DANs. And we would like to drive up the adoption by other distros
as well. And cgroup2 also improves support for nested containers. Like a lot of this work we've been talking about with rootless and DMLS, like cgroup2 makes a lot of that easier as well. So those are the motivations to have better integration between libcontainer, well, to better support for cgroup2
in libcontainer. I give a small explanation of where the components are. Many of you might be familiar with that. But there is quite some container managers these days, for instance, like Podman, Docker, Cryo, and Containerd. So Podman and Docker are mainly user containers.
So user container managers, so you run your own containers there, while Cryo and Containerd are backends, like our demos that are serving Kubernetes. And yeah, kubelet talks to either Cryo or Containerd
through the CRI protocol. And all of those actually use runc to execute, to actually run the containers. And runc uses libcontainer. Actually, both components are like same project,
same source tree. And which of those components support the unified hierarchy, support cgroup2 natively? And only one of them does presently, which is Podman. And it does it mainly because it uses C run, which is like a re-implementation of the runc CLI,
and has fixes to work with cgroup2. So does that mean that we could simply have the others use C run, and that would solve our problem?
In fact, it doesn't, because everything else, in fact, depends on links to libcontainer, uses libcontainer. So just changing the runtime is not enough. So digging a little into libcontainer. So libcontainer has several components
to kind of create abstractions on Linux's features for containers. So versus namespaces and capabilities are other things that libcontainer abstracts. And cgroups are one of them. And so we're looking at the cgroup part of libcontainer.
And it supports two separate drivers. One of them is cgroupfs, which essentially writes directly to the cgroup3, which is a fast cgroup file system. And the other one is systemd that uses dbus calls to talk to systemd. And so the main, if the systemd driver was always
going to systemd, this wouldn't be a problem, because systemd dbus interface abstracts whether you're running unified or hybrid or legacy, and exposes like a consistent API.
But the systemd cgroup driver or libcontainer container actually uses systemd for some operations and then goes around it to change the limits and settings in cgroup3 as well. So first attempt in revamping this libcontainer driver
was rewriting large parts of the systemd cgroup driver to actually go through systemd all the time through the bus and not really write directly to the cgroup3 anymore. And I actually opened up a PR for that. But this attempt actually failed.
And it failed, but was useful to learn something from it. So one part is touching this legacy code that's been around is hard. There are problems there. One thing is compatibility with versions
of systemd, so the reason why this libcontainer driver is writing directly to the cgroup3 is that when it was written, systemd was not supporting many of the limits setting. And so this first attempt was fine if I run it against systemd 2.4.1 or 2.4.2 or 2.4.3
for sure, but not really if I go back to systemd from say something widely used like Ubuntu 18.04 or even like 7 with systemd v2.19. So that's definitely an issue that needs a fix here.
And the other part is features that were missing from either systemd or the kernel. And one example is freezing like a cgroup that was only made available recently on the kernel on cgroup2 implementation in the kernel. So that was another part. I mean, there were some attempts to work around it,
but in the end, it's like a limitation there. So the new proposal is to split this systemd instead of, let's say, rewriting the code of the systemd driver
is actually splitting into two separate implementations, one for the legacy hierarchy that's going to cover basically like legacy or hybrids, but basically it's going to be cgroupv1 mostly, and it's actually the current code. And the second one that's going to handle the unified case.
And the one with the unified case is going to be able to go through systemd essentially all the time. So yeah, a separate unified interface. And the advantage of this approach is that we can enable the implementation only
when we figure out that we're running on the unified tree. And so we don't need to worry about compatibility with other OSes like CentOS 7, RAL 7, Ubuntu 18, or even 16 or so.
So one of the issues I mentioned is for compatibility of the dbus API. So this is something that should be solved in systemd to allow for this kind of implementation. This is like a general issue with clients of systemd for this group API.
So right now, if you make a systemd unit and you write specific options, if systemd doesn't recognize some options, it's going to simply ignore those. And that's by design, and that's fine because if you're using limits that only a newer version of systemd is going to recognize and you run it on another version of systemd,
that unit is still going to work. And it's going to simply ignore those directives it doesn't know. But that's not the case with the dbus interface. So when I'm creating a unit with a dbus interface and I ask it to do like, let's say, IO wait, which is like something that's new that's coming with it, it's not even like in the latest systemd.
And it doesn't recognize it simply like that dbus call is going to fail. So this needs to be fixed in systemd and probably through a dbus protocol that can take optional directives and probably report back
on the ones that weren't available in that version of systemd. Missing features in systemd and kernel is something that has been worked on. And for instance, freeze support was something that was mentioned previously. And that's actually available, like cgroup.freeze is available on cgroup 2.3 in kernel 5.2.
So that combined with the need for a dbus API to do backwards for compatibility and backward compatibility of directives means like we probably need a fairly recent stack
of kernel plus systemd to make this work. The good news is that like we seem to be right on time for distros to actually start switching to running unified hierarchy. So we can assume that most of those features are gonna be in place when they switch
to unify their hierarchy. So we can solve this problem in a way that all the components are deployed already together and everything works. So the vision feature is one where the driver detection based on unified or legacy hierarchy is made by libcontainer and can switch
to using a system, the implementation based on mounting the unified hierarchy. And it's gonna be fully functioning starting on specific versions of systemd in kernel. Kernel 5.2 looks like it has most
of the features needed. And perhaps the next version of systemd could have everything that we need for that as well. And hopefully that helps driving up adoption of unified hierarchy by other distributions other than Fedora. And I wanted to call out to Giuseppe from Red Hat
who has been working on this problem. His focus is slightly different than this one. He's been working on re-implementing the access to the cgroup3 on libcontainer. And he has one PR merged. And so he already split the legacy and unified driver.
And he partly fixes the problem. He doesn't fix the problem for runc but fixes the problem for the other uses of the library like kubelets and kryol. And still writes into the cgroup3 directly. And he doesn't really implement some of the controllers like for instance like the device controller
in cgroup2 like the systemd implementations based on eBPF. So that would require like writing eBPF into libcontainer and that's one of the reasons not to do that. I mean to go through systemd because you don't need to re-implement that. One alternative or additional approach you consider is using systemd's recommendation for delegation.
So that instead of simply creating new cgroups under the root of the tree, it would essentially use the recommended approaches like the network wrote this document like a while ago with recommendations of like how you could just simply,
well you can just use systemd natively so you get slice units and scope units. But you can also like simply have your service delegates and then create your tree under your service or create a new scope with delegation and then create your own tree.
So that would mean that like container manager manages that whole unit and from systemd point of view it's a single cgroup essentially. There are drawbacks from that approach which means like seeing this as a single thing means like whenever it needs to take action on that unit
it's gonna see all the containers running on the machine as a single unit. And one item for future work is evaluating like the cgroup queue controllers and the OCI, OCI is the standard like image format for docker containers
and kubernetes containers. And so the OCI specifies like, well there's the image itself and there's also like which constraints you use, which limits you set and so on.
And it turns out the OCI attributes for resource control are very tied to this cgroup one model. Which has been changed a lot in cgroup two and cgroup two is still evolving like we were talking earlier about the new controls that are coming.
So this probably can take some work in looking at newer controls, perhaps higher level controls instead of very low level ones. And like ability to do extensions as well. And yeah, that was it. Thank you.
Hi. So I've been reading those pull requests for a while
and I've got the impression that run C is slowly moving to become a wrapper on top of system D. Is my impression correct? I think, maybe. I think there's the interest of supporting the case
of not running on system D anyways. Like many use cases are like, for instance like nested containers where you don't necessarily have like a system D pit one in the first container. So you wanna support the use case of,
yeah, the use case of being able to write directly to a delegated cgroup three. Also like I mentioned like libcontainer doesn't do only cgroups. This is actually like a very small part of what libcontainer does. And it only needs to interact with system D
for those particular cases of managing cgroup three. Like namespace is something you do on your process. Capability is also something you can set. So in that sense, yeah. I'm not sure if that answers it.
Hi. Hi, regarding the fact that OCI embeds basically how secret v1 was designed into the API, I wonder whether or not this also affects system D. I know obviously D bus is extensible,
but how is that problem being dealt with for the D bus API? Sure, yeah, so there were some cases where, yeah, some limits were met. I think in a way, looking at the history of system D, what happened is like cgroup two and system D,
like system D support for resource control were developed pretty much in parallel so that system D at some point even like stopped trying to expose some of the cgroup one APIs waiting for the cgroup two API to happen.
There have been some cases where like some directives were mapped, like memory limit mapped to memory max since the name on cgroup two is memory.max, and it's like, semantics don't match exactly, but they're close enough in some cases, yeah.
So in that sense, I think system D bypassed this problem somehow like for the most part by at some point planning to implement cgroup two mostly. Yeah, the Kubernetes community was somewhat hostile to the changing to cgroups v2 for a while have,
sorry, yeah, so when we were trying to move to cgroups v2, the Kubernetes community was probably the, and specifically Google was hostile
to changing to cgroups v2 because they thought cgroups v1 was good enough. Have you seen that attitude change? I was sad to see you leave Google to go to Facebook. No, I think essentially what happened is like a question of prioritization, right?
It wasn't the top priority. And now when people are thinking, especially things like eBPF and how much stuff is coming on eBPF and it's like you see people talking about eBPF all the time and that's the thing that I think is changing a lot of the attitude towards cgroup two because we want eBPF
so let's take cgroup two for that reason. But yeah, there's a lot of enhancements on cgroup two and yeah, actually tomorrow we'll have Anita and Daniel talking about oomdee and Johanna's talking about senpai. So there's a lot of stuff that's being developed
on cgroup two. People saw the potential in cgroup two and invested there. So while cgroup one was clearly catching the limits. Hi, speaking about that too,
weren't those cgroup, I mean, eBPF features also enabled by choosing a hybrid approach? Why was the hybrid approach not pursued more? There are actually limits to, I mean, you can get some BPF features,
like you can run some BPF traces on the hybrid approach but it's fairly limited with what you can do. Like if you actually unlock the unified cgroup three with the controllers, like the information you get is much richer. So I mean, to a limited extent,
yeah, you can get some of the eBPF features on the unified approach but you get much better integration in cgroup two, yes.
Just to mention that regarding the hybrid approach, like in retrospect, I think it was a mistake we have added, that's just to me. Like it's a stop gap that has no future and we should not have done that. So yeah, forget about the hybrid mode. It's just, I mean, if you waste your resources in that, then you waste them for nothing. It's where we are today, yeah. So I'm sorry if I touched that.
It's okay, I can come here and give a talk. I was just kind of curious in some of the stuff like even what you've mentioned with C run, is if that was, I know that a lot of people are using libcontainer in places but it's been kind of curious
and I'm not just boasting it because Giuseppe works for Red Hat but we're getting contributions from it from interesting places because it now supports this stuff natively and it runs lighter and like MIPS and all this other kind of stuff, was it even much of a consideration there to just do that or? Sorry, I didn't understand. To use C run instead of libcontainer or what?
Sorry, to use C run instead of libcontainer? No, oh, you mean, oh yeah, okay. To use C run instead of libcontainer. Or instead of run C and so forth, yeah. Yeah, so I mean, C run solves some of the immediate problem of like unblocking part of this
because C run, I mean, it's a whole implementation and has been done with a lot of C group two directly but the fact is C run is like a standalone container runner while libcontainer is a library that's actually used by most of the other components, right?
So yeah, that's why just switching to C run doesn't work for the general case. I mean, it would work for cryo, like just not using libcontainer and using C run but kubelet is also using libcontainer to create its slices and right, so.
Yeah, on the topic of mistakes we should never have made, I think telling people they should use libcontainer or even suggesting this was a good idea was a mistake in retrospect, I think, especially since the libcontainer API makes absolutely no sense and so, yeah. I mean, it's a bit late to say this now but it would have been nice to convince people
four or five years ago to not do this but yeah, we're stuck with it unfortunately but I do think that we should, even if we do get secret video stuff in libcontainer, we should still convince people to stop touching it directly at least until we redesign it or something. All right.
One last question. So a question on that specific slides, most of it is red and that's kind of like blocking, let's say, what adoption on secret v2 everywhere. Do you have more or less a gut estimation of like a rough timeline of
when that is gonna get like greener, at least like reaching the top levels of greens? So actually, I mentioned Giuseppe's PR, which actually was created and merged after this slide so like that actually unblocks a lot of this stuff and I'm fairly sure it unblocks the kubelet connection
so kubelet in libcontainer can use cgroup q already and I assume cryo as well. Yeah, okay. Yeah, so that unblocks it like most of it, yeah. All right, thank you, Felipe.
Thank you.