We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Cgroupv2: Linux's new unified control group hierarchy

00:00

Formale Metadaten

Titel
Cgroupv2: Linux's new unified control group hierarchy
Serientitel
Anzahl der Teile
47
Autor
Mitwirkende
Lizenz
CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache
Produzent

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
Cgroupv1 (or just "cgroups") has helped revolutionise the way that we manage and use containers over the past 8 years. A complete overhaul is coming -- cgroupv2. This talk will go into why a new control group system was needed, the changes from cgroupv1, and practical uses that you can apply to improve the level of control you have over the processes on your servers. We will go over: - Design decisions and deviations for cgroupv2 compared to v1 - Pitfalls and caveats you may encounter when migrating to cgroupv2 - Discussion of the internals of cgroupv2 - Practical information about how we are using cgroupv2 inside Facebook
37
SystemprogrammierungGruppenkeimKontrollstrukturHierarchische StrukturFacebookPhysikalisches SystemEntscheidungstheorieServerDienst <Informatik>Domain <Netzwerk>Inverser LimesATMSpeicherabzugBeanspruchungMetrisches SystemROM <Informatik>TaskDatenverwaltungVerzeichnisdienstInterface <Schaltung>Service providerGamecontrollerFacebookGruppenoperationRechenschieberProdukt <Mathematik>VersionsverwaltungProgrammfehlerEntscheidungstheorieSpeicherabzugBeanspruchungBenutzerbeteiligungPhysikalisches SystemProgrammierumgebungVirtuelle MaschineCASE <Informatik>Service providerProzess <Informatik>ServerTermUmwandlungsenthalpieCodeMulti-Tier-ArchitekturLastWarteschlangePunktElektronische PublikationZeichenketteTypentheorieVererbungshierarchieDifferenteFormale SpracheAuflösung <Mathematik>Keller <Informatik>MultiplikationHierarchische StrukturSystemaufrufCachingMereologieWellenpaketMAPGrößenordnungKernel <Informatik>LastteilungKartesische KoordinatenJust-in-Time-CompilerDatenverwaltungSystemverwaltungHackerATMSchedulingBitInterface <Schaltung>InformationMultiplikationsoperatorGüte der AnpassungDomain <Netzwerk>HalbleiterspeicherFundamentalsatz der AlgebraMetrisches SystemExpertensystemSoftwarewartungBetriebsmittelverwaltungSoftwareentwicklerKonfiguration <Informatik>SchwellwertverfahrenBenutzerfreundlichkeitFunktionalSchnittmengeFramework <Informatik>Inverser LimesGrenzschichtablösungLeistung <Physik>Web ServicesVerzeichnisdienstComputerarchitekturProxy ServerFamilie <Mathematik>FitnessfunktionDatenstrukturInzidenzalgebraBefehlsprozessorAbfrageSynchronisierungHilfesystemRechter WinkelXMLComputeranimationJSONUML
SystemprogrammierungHierarchische StrukturVersionsverwaltungPhysikalisches SystemKernel <Informatik>GruppenoperationGamecontrollerHierarchische StrukturHalbleiterspeicherATMBootenDefaultGenerator <Informatik>Interaktives FernsehenMixed RealityXMLComputeranimation
SystemprogrammierungPhysikalisches SystemWeb-SeiteHierarchische StrukturVersionsverwaltungDifferenteVerzeichnisdienstGamecontrollerHauptidealringZellularer AutomatMAPKartesische KoordinatenProzess <Informatik>GruppenoperationSchlussregelProgram SlicingBenutzerfreundlichkeitElektronische PublikationHalbleiterspeicherEindringerkennungXML
GruppenkeimSystemprogrammierungROM <Informatik>Hierarchische StrukturVererbungshierarchieRelativitätstheorieWurzel <Mathematik>HauptidealringKategorie <Mathematik>Negative ZahlKernel <Informatik>DistributionenraumGruppenoperationInverser LimesHierarchische StrukturVerzeichnisdienstGamecontrollerBefehlsprozessorHalbleiterspeicherEinfache GenauigkeitDateiverwaltungResultanteProgram SlicingLeistung <Physik>Prozess <Informatik>GruppentheorieProtokoll <Datenverarbeitungssystem>Physikalisches SystemPerspektiveDivergente ReiheComputeranimation
Hierarchische StrukturSystemprogrammierungGruppenkeimROM <Informatik>MenütechnikHauptidealringNebenbedingungProzess <Informatik>BeanspruchungHalbleiterspeicherGruppenoperationBefehlsprozessorWeb-SeiteNichtlinearer OperatorVersionsverwaltungDifferenteMultiplikationsoperatorThreadPhysikalisches SystemImplementierungRelativitätstheorieCachingKontextbezogenes SystemElektronische PublikationCASE <Informatik>Hierarchische StrukturZahlenbereichMultiplikationElektronischer ProgrammführerProzess <Informatik>Framework <Informatik>ÄhnlichkeitsgeometrieKartesische KoordinatenGamecontrollerMomentenproblemOrtsoperatorFokalpunktDistributionenraumProgrammierungCOMMatchingKlassische PhysikEinfache GenauigkeitNebenbedingungDomain <Netzwerk>DefaultEntscheidungstheorieWurzel <Mathematik>MereologieKategorizitätMAPNP-hartes ProblemQuick-SortObjekt <Kategorie>VererbungshierarchieEinsSchnittmengeInhalt <Mathematik>Proxy ServerTabellePerfekte GruppeInverser LimesDatenverwaltungLoopOffice-PaketLeistung <Physik>TypentheorieOrdnungsreduktionDruckverlaufResultanteRechter WinkelDivergente ReiheXML
Systemprogrammierungp-BlockGebäude <Mathematik>VersionsverwaltungFundamentalsatz der AlgebraCASE <Informatik>GamecontrollerKontrollstrukturKomplex <Algebra>MultiplikationsoperatorKernel <Informatik>ImplementierungMathematikGruppenoperationUmkehrung <Mathematik>Schar <Mathematik>Computeranimation
SystemprogrammierungWeb-SeiteRechnernetzCachingMereologieInverser LimesGarbentheorieVersionsverwaltungSchnittmengeMultiplikationsoperatorRechter WinkelWurzel <Mathematik>Elektronische PublikationDatenreplikationGruppenoperationKernel <Informatik>Web-SeiteMAPKartesische KoordinatenProgrammierungPunktHalbleiterspeicherp-BlockSpieltheoriePuffer <Netzplantechnik>Mini-DiscInformationsspeicherungInverser LimesCachingProzess <Informatik>BeanspruchungSystemaufrufMinkowski-MetrikEinfach zusammenhängender RaumJSONXMLUML
CachingWeb-SeiteRechnernetzMereologieInverser LimesMultiplikationSystemprogrammierungHalbleiterspeicherKartesische KoordinatenGamecontrollerRelativitätstheorieHierarchische StrukturWeb-SeiteRechter WinkelResultanteVersionsverwaltungBefehlsprozessorGruppenoperationProzess <Informatik>SoftwareInverser LimesSichtenkonzeptKernel <Informatik>ComputeranimationXML
AbstraktionsebeneGruppenoperationKonfiguration <Informatik>ROM <Informatik>GamecontrollerGerichtete MengeTelekommunikationGruppenkeimProzess <Informatik>HalbleiterspeicherKartesische KoordinatenInverser LimesPunktGamecontrollerExogene VariableGruppenoperationGefrierenMultiplikationsoperatorGrößenordnungComputeranimation
GruppenoperationTelekommunikationBaum <Mathematik>Konfiguration <Informatik>ROM <Informatik>GamecontrollerGerichtete MengeSystemprogrammierungKlon <Mathematik>EreignishorizontRothe-VerfahrenProzess <Informatik>SystemprogrammHierarchische StrukturSynchronisierungProzess <Informatik>GamecontrollerKartesische KoordinatenSystemprogrammWeb-SeiteGruppenoperationSchwellwertverfahrenSchnittmengeBetrag <Mathematik>PhasenumwandlungStellenringMultiplikationsoperatorMinimumInterface <Schaltung>Quick-SortHeuristikVersionsverwaltungKeller <Informatik>Physikalisches SystemAggregatzustandHierarchische StrukturEreignishorizontServerEntscheidungstheorieInverser LimesHalbleiterspeicherKontextbezogenes SystemKonditionszahlKonfiguration <Informatik>Minkowski-MetrikSeitentabelleVerzeichnisdienstKlon <Mathematik>Elektronische Publikationsinc-FunktionBeanspruchungWeg <Topologie>PunktFrequenzSchlüsselverwaltungPrimitive <Informatik>MathematikBinärcodeDruckspannungGüte der AnpassungJoystickPartikelsystemSensitivitätsanalyseCASE <Informatik>Einfache GenauigkeitBetafunktionSpezielle unitäre GruppeDatenverwaltungKernel <Informatik>DruckverlaufXML
SystemprogrammierungSystemprogrammGamecontrollerHierarchische StrukturProzess <Informatik>SynchronisierungWiderspruchsfreiheitROM <Informatik>Inverser LimesURLServerDienst <Informatik>Physikalisches SystemBeanspruchungHauptidealringBefehlsprozessorServerFunktionalSchlussregelZahlenbereichArithmetisches MittelSystemaufrufSchnittmengeRFIDHalbleiterspeicherInverser LimesGamecontrollerGruppenoperationTypentheorieWiderspruchsfreiheitOffene MengeHauptidealringPuffer <Netzplantechnik>VideokonferenzKrümmungsmaßSoftwareentwicklerGewicht <Ausgleichsrechnung>Objektorientierte ProgrammierspracheSocketVererbungshierarchieBildschirmmaskep-BlockBeanspruchungBenutzerfreundlichkeitProxy ServerDifferenteFormale SemantikUmkehrung <Mathematik>Physikalisches SystemE-MailKernel <Informatik>Kartesische KoordinatenMailing-ListeVerkehrsinformationFacebookMagnetbandlaufwerkUmwandlungsenthalpieRechenzentrumKonditionszahlProgrammfehlerImplementierungVersionsverwaltungEinfache GenauigkeitHierarchische StrukturBetriebsmittelverwaltungLimesmengeElektronische PublikationGefangenendilemmaFigurierte ZahlDomain <Netzwerk>Ein-AusgabeFamilie <Mathematik>BefehlsprozessorAuswahlaxiomVirtuelle MaschinePunktZehnErwartungswertBenutzerbeteiligungMereologieStabilitätstheorie <Logik>SpeicherabzugGrößenordnungÄhnlichkeitsgeometrieZweiXML
Reelle ZahlROM <Informatik>SystemprogrammierungWeb-SeiteEreignishorizontCachingBefehlsprozessorMetadatenDateiverwaltungMetrisches SystemKernel <Informatik>GamecontrollerZustandsdichteVerschlingungHalbleiterspeicherMetrisches SystemRelativitätstheorieZahlenbereichEinflussgrößeDifferenteWeb-SeiteDruckverlaufPuffer <Netzplantechnik>Physikalisches SystemHierarchische StrukturDateiverwaltungTypentheorieMultiplikationsoperatorGamecontrollerMaßerweiterungElektronische PublikationVektorpotenzialVersionsverwaltungGüte der AnpassungMetadatenKernel <Informatik>CachingBefehlsprozessorFahne <Mathematik>GefrierenFormale SemantikInverser LimesRegelkreisWeg <Topologie>ComputeranimationXML
SystemprogrammierungFacebookXML
Transkript: Englisch(automatisch erzeugt)
OK, so hi. My name is Chris Down. I work at Facebook London as a production engineer. I'm going to be giving kind of a whistle-stop tour of the new version of control groups added in Linux 4.5. Don't worry if you haven't the faintest idea what a control
group is yet or you only vaguely know. We'll go over some basics in the next couple of slides. So like I said in this talk, I'm going to be going over the next version of cgroups. I'll give a short introduction, kind of what they're for, where you may have seen them. If you already know something called cgroup, you've almost certainly been using version 1 of cgroups.
Version 1 has been out since 2008. And in many ways, it's kind of helped to kickstart our love of containerization and process management. It's kind of the backbone of a lot of systems like systemd and Docker and that kind of stuff. So we've been using it all over the place since then. So obviously, it has a bunch of good functionality. Unfortunately, it also has a ton of caveats and issues
and kind of usability shenanigans, which make it really difficult to use sensibly. Secret v2 is our attempt to fix these and improve it. And it's under a lot of active development now. Secret v1 is mostly in maintenance mode. So I want to go over kind of why we needed to introduce
a new major version of cgroups, why we couldn't just do more improvements to version 1. I also want to go over some of the fundamental design decisions in Control Group v2. And another thing is that Control Group v2 is being made also to enable a bunch of future improvements. So I want to go over what's ready for use in production
and what is still kind of in the pipeline. The general idea is that the core is ready, but we still have a whole bunch of more goodness yet to come. So first, a little bit about me. I've been working at Facebook for about three and a half years now. I work in this team called Web Foundation. Technically, Web Foundation, as you'd expect by the name, is responsible for the web servers at Facebook.
But web servers are generally not a super complicated thing. So we also kind of act as probably the closest thing that Facebook has to an SRE team. So we delve into the whole stack at Facebook. We deal with production issues, basically all kinds of issues across the stack. We own instant resolution in production at Facebook.
As you'd imagine then, if we own this very large piece of Facebook in general, we have a whole bunch of different types of people to support us in that. So we have system debuggers, that's mostly what I do. Most of my background is in system administration and system debugging. We also have domain experts in our cache architecture,
RPC, task scheduling. We also have experts in things like Hack and HHVM, which is our JIT compiler for PHP. And we're not working on SIG groups. Most of my time is spent dealing with these kind of systemic issues across Facebook and dealing with services issues as they come up. So this brings me on to why I give a shit about SIG groups. So we have many, many, many hundreds of thousands
of servers at Facebook. And we run a bunch of services on those servers. And I care a lot about limiting the failure domains that we have across Facebook. The reality at Facebook is that most outages are not a single service having problems. They tend to be failures across multiple services,
sometimes cascading failures. And we want to be able to restrict those. We don't want multiple services to be able to affect each other like that. There are a lot of things that you need to do to be able to get to the stage where you're comfortable saying cascading failures are somewhat mitigated. One thing you need to do is understand and isolate your dependencies and try and minimize the dependency train of applications.
But another huge thing that you need to do is to be able to stop other processes on the machine from affecting the thing that you actually want to do, from stopping you from doing that. So if we look at kind of a typical server, so you have the core workload on pretty much every machine. It's the main thing that you want to do on that machine.
For example, on our web servers, that would be HHVM, which runs our PHP code. Or on our load balancer, it would be, for example, like HAProxy or Proxigen, which is our load balancer. This is the thing that you do on that machine. This is the thing that if you were to describe to somebody else what that machine does,
you would say it was this. There are also a bunch of system processes. Most large companies and even small companies nowadays have this kind of tax, which is essentially processes which you have to run to work inside your infrastructure. These typically help the core workload in some way. They might be a dependency for the core workload, or they might be run to keep the system working.
But they tend to be less important than the main workload. For example, Chef is really vital for an up-to-date machine. But if there was some bug in your cookbooks or you can't run successfully and it's constantly thrashing the machine, you don't want to stop web requests being run for that. You want to deal with that separately. Then you have these kind of rarer things
like ad hoc queries and debugging. These are things that you typically only know that you need reactively when you're dealing with an incident. So these can vary in importance. Some of them you want to run in the background. Some of them you actually actively want to interrupt. The main workload for. And we want to kind of give people the power to dynamically determine the importance of these things.
So this is a kind of problem that is really a very good use case for cgroups. So in the previous slide, we talked about multiple processes fitting into each of these groups. A control group can consist of as many or as few processes as you like. And you can set limits and thresholds as tightly or as flexibly as you like for a service or application.
For example, you can have all processes which relate to a particular service in one cgroup, or do whatever you like. We don't impose a structure on you. That's the idea of cgroups. The idea is the framework should be flexible to your requirements and not impose a direct structure on you. So a cgroup is a control group. They're one and the same. As you've probably guessed by now,
they're a system for resource management on Linux where resource means something that processes share. CPU, memory, IO, that kind of stuff. Management is a bit more complicated. You're probably already thinking of things like the oom killer. That is one option you have. We also have some kind of more subtle things which you can do with cgroups like throttling. We also provide a bunch of accounting,
so every single part of the cgroup hierarchy exposes metrics about what's going on, which makes it much easier to debug. Cgroups are typically not a very complicated thing to manage because they are directories at sysfs-cgroup. We don't have a system call interface for the very core parts of cgroups. We do have for some more esoteric parts.
But this means that creation, deletion, modification, that kind of stuff is all doable by your typical application. All you need to do is mkdir, rmdir, right? I hope whatever hits the language people are using nowadays still supports those things. This makes it really, really trivial to interact with cgroups no matter what you're using.
You can just go into your shell and cut some files or less some directories, and you get information about the cgroup hierarchy. Each resource interface is provided by what's called a controller. This controller essentially provides files which you can manipulate, and it interacts with the kernel. So say the memory controller provides a file called memory.limitInBites,
which allows you to set a memory limit. And when you set a value inside that file, just by printing a string to this particular kind of sfile, you change the way the kernel will behave. You tell the kernel, hey, I want you to do this for this particular cgroup. So as mentioned previously, workload isolation is a super large use case for cgroups. You might have one thing on your machine
which you want to run and a bunch of background services which you want to deprioritize. The same is true for things like asynchronous jobs. At Facebook, we have a large asynchronous tier which runs things which can be processed in the background in a different queue. And some jobs may have higher priority than others, or be longer running than others. But what priority means here
is typically very case specific. It might be that we give it more CPU, we allow it more access to the disk, we allow it more access to memory, or anything. Priority is generally discernible in terms of resources. Another third use case is these kind of shared environments like VPS providers which run containers
where you don't want to allow one particular customer to override the needs of another customer and start to effect or dip into their allocation. So you might be thinking at this point, hey, what the fuck is this guy talking about? My favorite product already has this functionality. Why do I need to talk about cgroups? Well, it might be true that your favorite product does have this functionality,
but if it's been made in the last eight or nine years, it almost certainly uses cgroups under the hood. Cgroups are the most mature interface that we have in the kernel for managing resource management and resource allocations. And these are generally the way forward. I think it's generally pretty accepted by now that despite cgroups being kind of the feature
which kernel developers love to hate, they are kind of the way forward. And even if your product doesn't use them, it almost certainly should be at this point. So let's take a look at how this works in version one. It's very important to understand how version one works to be able to understand how version one doesn't work. So like I mentioned, if you've had some interaction with cgroups in the past,
it's almost certainly been with version one. Version two has been in development for over five years now, six years I think now, and it's only become stable though quite recently in kernel 4.5. Even on recent kernels, you'll discover version one is typically used by default. And what I mean by that is the kernel boots supporting both,
but typically your init system only mounts the version one hierarchy. The kernel by default also only typically enables the controllers, like the memory controller or the CPU controller, for the version one hierarchy. These controllers can only exist in either version one or version two. You can run in this kind of mixed mode.
So in recent versions of systemd, we actually mount both the version one and version two hierarchy, but we don't actually use version two for resource control. We only use it for some systemd internal stuff. And what we really need is this resource control, which is what we're going towards. The reason that we still mount the version one hierarchy and still use it mostly is for backwards compatibility.
Most applications don't give a shit which cgroup hierarchy you use, they don't look at it. But applications like Docker, for example, or systemd as well, have to support cgroup v2 if they're going to actually try and use the hierarchy to do things. For most applications, it's completely transparent. But for those lower level applications,
it tends to be quite important. So this talk is also kind of a sell on why you should care about cgroup v2 and why you should work to support and understand it. Because understanding how version one works is really key to understanding the improvements that have been made in version two. So in version one, sysfs-cgroup contains controller names or resources
as directories at the top level. Resources like CPU, memory, PIDs, IO, that kind of stuff. Inside these directories are hierarchies for each resource. You can see inside here we have the PID controller, which contains a bunch of different slices, just a systemd terminology, but these are all essentially cgroups. They are directories which are cgroups.
And each directory inside here will contain files which are related to the business of controlling process IDs. So each resource here has its own resource distribution hierarchy. Resource A here could be memory. Resource B could be CPU. And one thing to note is, even if cgroup three here in resource B had the same name as cgroup one in resource A,
say they're both called foo.slice, from the kernel's perspective, they have absolutely no relation to each other, even if they contain the same processes, which has some really interesting and somewhat negative implications, which I'll come back to later. You might also note that the cgroups are being nested inside each other in this example. For example, cgroup two is a child of cgroup one.
Generally what this means is that cgroup two inherits the properties of cgroup one and can set more restrictive limits inside its own cgroup. So one PID isn't exactly one cgroup per resource in cgroup B1. So PID two here is explicitly assigned to resources A and C, but we didn't explicitly assign it in resource B,
so it's in what's called the root cgroup. The root cgroup is at the base directory for this resource controller. For memory it would be sysfs cgroup memory. The root cgroup is essentially limitless. It's not very useful. It's generally for things which we've not categorized at all. You still get some kind of accounting, but that's basically it. You don't really get anything.
These things are essentially unlimited. So here's a concrete look at how this looks at cgroup one. Like I say, I really wanna iterate this because otherwise the rest of this talk is gonna make no fucking sense. So the cgroup file system is typically mounted at sysfs cgroup. Inside you have these resources, like memory, CPU, that kind of stuff. You can have a single PID in cgroup foo in one resource,
but cgroup bar in another. You don't have to have them in the same cgroup in different resources. And again, even though we have two cgroups here, it seems, one named adhoc and one named bg, there are actually four cgroups. From the kernel's perspective, even if they have the same name, they're completely unrelated, and this has a bunch of negative effects.
So let's take a look at how this works in cgroup B2, now that we've talked about cgroup B1. So in cgroup B2, you might notice now, at sysfs cgroup, we no longer see the names of resources. We used to see memory, CPU, IO, that kind of stuff. Now we just see background.slice, workload.slice. We just see the cgroups themselves. So how does the cgroup know which resource it should apply to?
So the answer is it doesn't. The way this works is almost entirely inverted. So now cgroups are not created for a particular resource. Resources instead are enabled or disabled in a particular part of the cgroup hierarchy. This means that we have a single hierarchy to rule them all. We don't need to have disparate hierarchies
for every single resource, which has a bunch of positive effects, which I'll go into in a moment. This means that you explicitly opt into, say, having the CPU controller enabled in a particular subtree of the cgroup hierarchy, and once you've opted in for this, we give you files like how much CPU we should give this application compared to other applications.
So in cgroup B2, we have a similar hierarchy here, but note the differences. Instead of having four cgroups like this, we now have two like this. Instead of also having a cgroup per resource, we now have resources per cgroup, which allows us to opt in to resources that we care about on the fly. You don't have to build these things as you go along.
As you can see here, in version one, we have a secret hierarchy per resource. That is, cgroups only exist in the unique context of a particular resource. They are not universal. And remember, again, that these cgroups, even though they have the same name, have no relation to each other. So the way that this works in cgroup B2
is you write to this file, this magical file called cgroup.subtreecontrol, and you write, say, plus memory, plus CPU, whatever, which particular resource you want to enable, and when you do this, files related to that resource appear in that cgroup's children for use. So what are the fundamental differences we're talking about here? So obviously the big one is this unified hierarchy where resources apply to cgroups now
instead of cgroups applying to resources. This is extremely important for some extremely common operations in Linux. A classic case is kind of a page cache writeback. These operations which transcend a single resource, because for example, a page cache writeback is CPU, IO, and memory all at the same time, and it's previously really difficult to decide
what operations are sensible to perform to reduce pressure. It's also really difficult to account for these things since we have different hierarchies for each resource. We can't tie in version one, one cgroup's actions in one resource to another cgroup's actions in another resource because they're not required to contain the same processes.
So with this single hierarchy, we now have a single thing to rule them all, and we can make decisions with much better context across the system. We also now in version two have granularity at the T-grid, not the TID level. The reason for that is because without extensive cooperation, it generally doesn't make sense to have thread granularity for the cgroup control.
The reason for that is generally, you need a cgroup manager, a single thing in your system which does cgroup distribution across the system, and you need to expose your program intention somehow. You need to expose this thread does this, and you should put it in this cgroup, and this thread does that, and you should put it in this cgroup, and so forth.
There's no real standardized way to do that in Linux. You can, for example, set the com of your thread and somehow set something to reg ex match on the com of your thread, but this is all kind of sideways because the real problem here is also that a lot of resources don't make any sense at the TID level. Like in version one, there was a non-trivial amount of people who were setting different memory cgroups
from different threads of the same process, which doesn't make any fucking sense in the vast majority of cases. It is kind of vaguely deterministic, but it generally doesn't work and doesn't do what you would expect. So we do actually, in version two, have also some more restricted APIs
for thread control where possible. Tejan, who is one of the primary authors of Secret V2, recently introduced this thing called rgroup, which is essentially a way to do thread control for some resources which makes sense, but these things are local to the process. So this is kind of limited to those use cases where it makes sense and has to be implemented per controller.
You can't just willy-nilly put them in a particular controller where it doesn't make any sense. We also have this major focus on simplicity and clarity over ultimate flexibility. In many places in version one, design followed the implementation because it wasn't clearly known like what the use cases were at the time.
Some flexibility in version one made implementation really, really difficult. For example, this per-thread control, like people putting threads in different memory cgroups and trying to account for things that cross multiple resource domains. And the idea here is that we should provide a framework that guides towards a correct solution by default. You shouldn't have to muck around
in the documentation forever to work out how your thing is even going to basically work. Another new feature in Secret V2 is this thing called the no internal process constraint. So this means essentially that cgroups cannot create child cgroups if they have processes and they also have controllers enabled. To put it another way, the cgroups in red here
either have to be empty or they have to have no controllers enabled at all. They have to have no memory, no IO, that kind of stuff. This is for a number of reasons. One of the primary ones is that generally, child processes don't make sense to compete with their parent for resources. And generally, doing that can be kind of hard technically. Another reason is that we have to make
some implicit decisions about what this means. Say I put a bunch of processes in cgroup I here and then I also put a bunch of processes in cgroup J. Now we have to make a decision about how we're going to consider two different types of objects. One, a child cgroup which is J and two, a single process which is contained
within the I cgroup. We can do all sorts of things. One of the things we did in version one was implicit cgroup creation. So if you put some set of processes in I, there would be this implicitly created I prime cgroup which would contain those processes and they would share kind of cgroup contention. But it doesn't make any sense to do this implicitly
and it's usually not what anyone thought would happen. So this is why we've moved to, you have to kind of explicitly put things at the leaves. You might also notice that the root cgroup is not red. The reason for that is the root cgroup is a special case for general system consumption, for things which we have not categorized. How the root cgroup is handled is entirely up to the controller.
The controller has to make a decision about how to prioritize the things which have not been categorized at all from the things which have been categorized. So obviously breaking the API is kind of a big deal. This is a very major kernel API. So you need a good reason to do this. So the reason here is like, version one worked acceptably in some basic scenarios,
but it gets exponentially complicated and not very usable in complex use. As I mentioned before, in version one, design often followed implementation. And the problem with that is reworking kernel APIs after the fact is really, really hard. Like generally you cannot change kernel APIs
after you've defined them clearly. So we kind of needed the API break there. Even for stuff which was designed up front and had like explicit design goals, the use cases for cgroups in 2008 when it was invented were not really that well fleshed out yet. It was kind of hard to work out at the time how cgroups would eventually be used.
This led to a bunch of over flexibility in places that you don't want it. And it also led to a whole bunch of complexity in places which should be simple, even the basic building blocks of cgroups. So to fix these fundamental issues in cgroups, we kind of had to create cgroup v2 because it fundamentally changes the way we think about resource control.
So I'm hoping you're still with me because I've gone over a lot of what we've changed, but not a lot of why we've changed it. So it really is important to understand what we've changed because otherwise the next section is not gonna make any sense at all. So I wanna go not only into what we've done, but why we've done it. What does cgroup v2 bring us
that we didn't have in version one? So pop quiz, it's Q&A time. When you write to a file in Linux, what happens? Don't be scared. Kyle, would you like to give an answer? He would not like to give an answer. Over there.
User space buffers in the program and then things can get buffered at like in the kernel level and then usually after the buffers kind of trickle down, there's the block disk writes. Absolutely correct. So that's totally, did everyone get that?
So basically the basic principle is there are a lot of layers of caching and buffering. The main one we're looking at here is the page cache. So when you write to a file in Linux, you issue a writes call or whatever, and your writes call may return almost immediately. And that's because what you've actually done
is not write a file to the disk, you've written a dirty page or some dirty pages into the page cache, into memory. And at this point, your writes call is returned with success. So hooray, your process can continue. But of course in the real world, it's not actually done. Your application can continue pretending it's done, but it's not actually done.
So eventually this dirty page needs to make its way back to the disk. It needs to make its way back to the storage device, which it's supposed to go to. But when does it get rid of the disk? How does it get rid of the disk? Who writes it to disk? Well, the dirty pages here were made on behalf of your application, but the flush to disk could happen an indefinite amount of time after, depending on your particular sys cuddles.
But the main point is these two actions are kind of disconnected. Like the eventual write to disk is completely disconnected from the write which you first made. So in secret review one, these page cache writebacks went to the root cgroup. They were essentially completely limitless. For some workloads, this can be a huge amount of IO and memory. Like a lot of IO and some workloads
can just be doing page cache writebacks, and we couldn't account for them. We couldn't even tie them back to your application. And not accounting for these means not only a bunch of IO is not available for accounting, but a bunch of memory is also not available for accounting. We can't account for these dirty pages. We can't tie them back to your application. So in secret review two, we actually track these actively
and map the request back to the original cgroup. So we're able to account these page cache writebacks back to your application and say this application was responsible for the pages which are now being written to disk and charge it say to your IO controller. So now we can also understand the relation of IO and memory for a writeback,
which we previously couldn't do since we had different hierarchies for every single resource. This also applies to some other kinds of things like imagine you're receiving a lot of packets from the network. That takes a non-trivial amount of kernel CPU in some circumstances. A lot of it can be offloaded, but in general, yes, it does take some amount of CPU. And it's also difficult to account for that
in secret v1 because we simply cannot say this action which occurred in the past is now related to your process. We couldn't tie those things together because we had no way of tagging those packets, the CPU that was involved as being related to your process eventually. So in secret v2, we can now do these things and perform some kind of reasonable methods
of reclaim or whatever you want to do to your process based on the limits. More things can be accounted towards your process limits. v2 is also generally better integrated with subsystems. So in version one, most of the actions we could take, for example, in the memory controller, were pretty violent. They were pretty crude, generally.
For example, pretty much the only sensible action you could take against the process if it violated some memory limit you set was to oom kill it, which is generally not what applications are like. Generally, applications don't respond very well to being kill-nined. That's not usually like the way applications like to be treated.
There is another way, which is we also had this thing called freezer. So what freezer would do is instead of oom killing a process, we would say, okay, we're gonna freeze it at this point in time. So we would essentially leave it there and some other system with some other context would come and make a decision whether we should unfreeze the process by raising the limits or whether we should do things like kill the process
or get a stack trace from the process and kill it. It was totally up to you. The problem was freezer in v1 literally more or less stopped you at whatever stack you were in. We could be in some very deep kernel stack and you would just be told stop. And the problem is a lot of these things are not resumable.
You can't just stop and expect it to go well after you start again. We also had a whole bunch of problems with one of the key things people wanted to do with freezer is go and grab a stack trace and then kill it. But in a lot of cases, these processes would just go into D state and would never be able to come out of it again. So when you try and attach GDB to that process, it would also go into D state,
which is not at all what you wanted at all. So it was kind of all sorts of fuckery and shenanigans going on with this freezer in v1. And yeah, it was just not workable really. So really the only option you had was to kill shit outright, which was not ideal. So we did have this,
there's a tiny note about at the bottom. We did actually have a soft limit in version one. However, it doesn't work. It's like, it's very difficult to reason about how it will work at any point. It has a bunch of heuristics around local, secret local memory pressure, global memory pressure, the phase of the sun, that kind of shit. It's like very, very hard to reason about.
It basically is impossible to reason about. So we can just pretend for the time being it doesn't really exist. So in Secret v2, we have much clearer thresholds on these hard, limitable resources. For example, we have memory.low, memory.high, memory.max, where low and high are best effort.
And if we had min and max, then they would be kind of absolute thresholds. For example, on memory.high, we do direct reclaim. So direct reclaim is essentially where we try and scan the page table and find some pages to reclaim. We do this when you allocate some more memory. So say you malloc or sbreak or whatever,
and if you were above the memory.high threshold already, we will try and scan and reclaim pages from the working set. This works whether or not your application actually successfully reclaims pages. Because if you successfully reclaim pages, then good, we're back under the limit again and it doesn't matter. It's like nothing ever happened.
If you don't reclaim pages, we have to scan a whole bunch of the page table before we allow your application to continue. So it acts as this kind of primitive slowdown for your application, which is kind of agnostic to your application. This works well in some scenarios and doesn't work very well in some other scenarios, but it allows you to have kind of a more granular control of how you want to treat applications which behave not as you expect.
One way that you can use this is to deal with temporary spikes and resource usage by slowing down an application instead of just killing it outright. For example, if your application at a certain point of execution always spikes to a certain point of resource usage, instead of just killing it every time it gets there, you can slow it down for that short period and then continue running.
We also have a new notification API. So notifications are essentially a way to tell something when a C group has changed state. So it could be that we have normal processes in the C group, which means all of the processes there have ended, or something oomed, or generally some action occurred in your C group.
systemd uses this under the hood to track processes and track process state. It uses it basically to manage which services are in a particular state and keep track of the system. So in version one, we actually do support this, but it can get really expensive. So in version one, to know when you have no more processes to run, you have to specify
what's called a release agent ahead of time. This release agent is literally a binary. It's just like you give a path to C groups and it will exec that path every single time that this C group has no more processes in. The problem is there are some asynchronous workloads which will create thousands and thousands
of C groups a second legitimately. And that means you have to do thousands and thousands of clones a second as well, which is a non-trivial amount of resource usage just on cloning shit. So we also have other events which you can look at in C group v1, like say if something oomed in the C group. These are done through the poll interface through the event FDF interface.
And this generally works, but since these are files, it also makes sense to have inotify to support. So now we also support inotify events, which makes sense since we're treating the secret hierarchy as a bunch of files and directories. And generally this is kind of a more intuitive API. We do still have the old ways of doing this,
but generally inotify is kind of a more sensible way of doing this overall. And this makes getting these notifications way less expensive than they were in v1. So utility controllers kind of also make sense now. Utility controllers are controllers that don't manage a resource directly, but for whatever reason want to have
their own secret hierarchy. Generally, they allow a user space utility to take some kind of actions based on the hierarchy. For example, in version one, the perf tool has a secret hierarchy called perf event. And perf is this tool which does performance tracing in Linux, and the perf event controller here has its own hierarchy to monitor and collect events
for processes in its secret hierarchy. The same goes for freezer, which also had its own secret hierarchy, and that encountered some problems because typically what you actually want to do is take the secret hierarchy from some other particular resource and mimic it in the perf events hierarchy
or mimic it in the freezer hierarchy. So you would have to do all sorts of crazy things like copy over all the different processes. It was bound to a bunch of race conditions and generally didn't work very well. So in version two, this is not a thing anymore because we have one hierarchy. So perf and freezer and everything all share the same hierarchy.
You don't have to do any copying anymore. Whereas in version one, this was kind of prone to failure and esoteric bug reports on mailing lists. In version one, we also have a bunch of inconsistency between controllers. This kind of manifests in two typical forms. One is inconsistent APIs between controllers which do almost exactly the same thing.
For example, for CPU, we have this shares API and for block IO, we have this weight API and they're completely unrelated to each other even though they basically do exactly the same thing. So in version two, we've made an explicit effort to have APIs that have similar possibilities for implementation to be as similar as possible.
This has both been an intentional goal and generally having a unified hierarchy makes this an obvious path to take. The second inconsistency between controllers is inconsistent semantics between different kinds of resources. So most cgroups, especially the core cgroups, like I mentioned, inherit their parents' limits.
If you have a child of a cgroup, it inherits its parents' limits and you can set more restrictive limits in the child. But some resources treated the cgroup hierarchy as almost like a dream or like something which you didn't even have to think about. The net controllers were kind of a classic example, which didn't really care about the cgroup hierarchy.
They just treated it as if it was one flat thing. So people were really confused when they tried to use these controllers. So the unified hierarchy kind of helps us towards avoiding these inconsistencies in controllers and we apply the same rules to controllers equally. They generally cannot deviate from the set of expectations that we have.
Another very severe problem is that some things in V1 were just simply impossible. For example, when memory limits were first made, we had this file called memory.limitingbytes and we went, whoopee, we have a memory.limitingbytes. But the problem is eventually it was known
that this covers a very limited set of memory types. And we couldn't really add other types of memory to be accounted for in memory.limitingbytes because again, it's a kernel API. It's a stable kernel API and you can't really change it. So you eventually ended up with a bunch of different memory types, each in their own file. So you didn't just have memory.limitingbytes. Now you have memory.kmem.limitingbytes,
memory.kmem.tcp.limitingbytes, one for swap, one for socket buffers. You had a different limit for every single type of memory. This poses an incredibly bad problem. So you have two choices now. Either you only set memory.limitingbytes and you accept the fact that your application is not actually bound by that limiting bytes
because it only accounts for a very small number of memory types. Or you set limits on every single type of memory type and you cry when you allocate one TCP buffer to many because you're going to get oom killed because you allocated one TCP buffer to many. I don't know about you, but when I'm writing an application, I don't usually think to myself,
ah yes, I have a very specific number of TCP buffers in mind for this application. Generally, this is not how people think. So in version two, this unintuitive behavior has resulted kind of in more unified limits. We just have this thing called memory.high, memory.max. We've tried to make these things encompass
all the types of memory that we possibly can. And this is kind of a trade-off between this flexibility and overall usability. From practical use and from talking to people, we know that merging these into a global memory limit generally makes the most sense for most workloads. This means you don't get those nasty surprises like, oops, I allocated one too many socket buffers and I got oom killed.
And if you really need separate limits, the proper way to do that is to have a new controller. You create a new controller which does this particular type of limiting. And that's what we did, for example, with the PID controller because it was originally thought you could limit the number of PIDs by limiting the amount of kernel memory in certain things. But it turns out that's really, really hard. So we have now a PID controller
which does that separately. So if you go to facebook.com now, you will touch a web server which has Secret v2. We're running a Secret v2 pool in the tens of thousands of machines. Easily the largest Secret v2 pool in the world. We're investing heavily in Secret v2
for a bunch of reasons. Like I said, my main concern is limiting the failure domain of applications and getting kind of this better handle on how system services are working across Facebook. Also being able to manage the resource allocations in your data center more efficiently is a big win, especially if you have a huge number of servers. We run Secret v2 Managed with systemd.
My friend Davide Kavilka over there gave a talk yesterday about that, which I'm sure if you didn't attend, then very, very sad. But you'll be able to find the video later. And we're a huge contributor to the core of Secret v2 and systemd's secret support. And we're continuing to kind of drive innovation here.
We have a lot of open issues against systemd and a lot of development which is being done. So Secret v2 has been stable for a little while now. That doesn't mean there isn't still work to be done here. The core APIs are stable, but there's still a bunch of functionality we're working on. When thinking about cgroups, most people think of three things, CPU, IO, and memory. The CPU controller is very important,
but unfortunately it's not merged until 4.15. The reason for that is the CPU controller folks had a number of reservations about some things which we were doing. Tijean especially has been working very, very hard to mitigate their concerns, which is one of the things which led to this rgroup API being made. So now we have it merged in 4.15,
which is not even stable. So eventually we will get there. As kind of a bonus, I do want to go over one thing we're using Secret v2 for and one thing we want to provide as part of Secret v2. So one thing we've never really had in Linux is a measure of memory pressure. We have a bunch of related metrics like memory usage and buffer usage,
and we can also look at the number of page scans, but with these metrics alone, it's hard to tell the difference between extremely efficient use of a system and overuse of a system. It's kind of hard to tell the difference. So one proposed measure here is to track page refoulting. The way it would essentially work is when you continually reclaim a page
and fault it back in again, we will account for this, we will measure this, we put it in a particular counter, and then we look for, did this page get revolted X times in say 100 milliseconds or a second? And that's a good measure for things like, are we exceeding our limits?
Are we constantly reclaiming something because we consider it's not in use and then having to fault it in again a second later? So this is one place that we're exploring as a potential measure for memory pressure. So as for future work, like I said, we currently have IO and memory accounting for page cache writebacks,
but we don't have CPU accounting. If you spend CPU there, we currently can't account for that, and that's something we're working on. V2 also has a bunch of different improvements in what types of IO we can account for. One thing we still can't account for is some kinds of file system metadata. So if you are Cough Apple and you store all of your files in extended metadata,
it's probably not going to end very well for you. I also just talked about this refault metrics for detecting memory pressure. And another thing we're working on is this freezer for V2, which will use semantics which are much more similar to SIG stop instead of just freezing you where you stand and possibly never coming out again.
So I've talked a lot to try and sell you on Secret V2. Hopefully you're interested in trying it out yourself. With systemd, these are the flags that you need. You essentially need to disable the Secret V1 controllers and also tell systemd to mount the new hierarchy. You need a kernel above 4.5 to do this. Before that, we do have unstable support, but it basically, yeah,
I wouldn't recommend using it before 4.5. So typically having your init system do this is a good idea, but if you do want to play around, you can also mount it directly by using the file system type Secret2. So if you're interested in hearing more about control groups, come talk to me. I'm happy to go over anything that I've been talking about. I think I have no time for questions,
but if you want to come talk about it, come talk to me. And if you've used version one in the past, which you almost certainly have, and you've encountered the kind of problems that I've been going over in this talk, please do come talk to me and let's see how Secret V2 can work for you. Thank you.