cgroupv2: Linux's new unified control group hierarchy

Video thumbnail (Frame 0) Video thumbnail (Frame 756) Video thumbnail (Frame 15382) Video thumbnail (Frame 16341) Video thumbnail (Frame 30745) Video thumbnail (Frame 44781) Video thumbnail (Frame 58587)
Video in TIB AV-Portal: cgroupv2: Linux's new unified control group hierarchy

Formal Metadata

cgroupv2: Linux's new unified control group hierarchy
Title of Series
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date
Production Year

Content Metadata

Subject Area
cgroupv1 (or just "cgroups") has helped revolutionise the way that we manageand use containers over the past 8 years. A complete overhaul is coming --cgroupv2. This talk will go into why a new control group system was needed,the changes from cgroupv1, and practical uses that you can apply to improvethe level of control you have over the processes on your servers. We will go over: * Design decisions and deviations for cgroupv2 compared to v1 * Pitfalls and caveats you may encounter when migrating to cgroupv2 * Discussion of the internals of cgroupv2 * Practical information about how we are using cgroupv2 inside Facebook
Revision control Hierarchy Pairwise comparison Game controller Group action Control flow Physical system Product (business) Local Group
Demon Group action Building Context awareness State of matter Multiplication sign Decision theory View (database) Set (mathematics) Bookmark (World Wide Web) Usability Formal language Web 2.0 Facebook Mathematics Different (Kate Ryan album) Semiconductor memory Core dump Office suite Information security Social class Physical system Personal identification number Curve Block (periodic table) Software developer Moment (mathematics) Fitness function Bit Hand fan Type theory Data management Process (computing) Befehlsprozessor Internet service provider Normal (geometry) output Right angle Lastteilung Quicksort Resultant Asynchronous Transfer Mode Spacetime Slide rule Server (computing) Game controller Functional (mathematics) Service (economics) Link (knot theory) Computer file Virtual machine Maxima and minima Thresholding (image processing) Number Product (business) Revision control Workload Goodness of fit Kerberos <Kryptologie> Hierarchy Data structure Booting Computer architecture Domain name Default (computer science) Dependent and independent variables Multiplication Interface (computing) Interactive television Expert system Incidence algebra Directory service Software maintenance Cartesian coordinate system System call Cache (computing) Word Kernel (computing) Software Integrated development environment Personal digital assistant Query language Abstraction Library (computing)
Group action Greatest element Thread (computing) INTEGRAL Multiplication sign 1 (number) Set (mathematics) Mereology Perspective (visual) Mathematics Spherical cap Hooking Semiconductor memory Different (Kate Ryan album) Information security Physical system Personal identification number Decision tree learning Theory of relativity Constraint (mathematics) Speicherhierarchie Moment (mathematics) Shared memory Maxima and minima Control flow Flow separation Arithmetic mean Process (computing) Befehlsprozessor output Hill differential equation Right angle Quicksort Fundamental theorem of algebra Resultant Point (geometry) Web page Game controller Implementation Computer file Constraint (mathematics) Virtual machine Maxima and minima Control flow Distance Theory Product (business) Local Group Revision control Term (mathematics) Natural number Hierarchy Operator (mathematics) Energy level Pairwise comparison Addition Focus (optics) Distribution (mathematics) Scaling (geometry) Inheritance (object-oriented programming) Directory service Limit (category theory) Approximation Word Kernel (computing) Personal digital assistant Rootkit Radio-frequency identification Video game Iteration Object (grammar) Routing
Group action State of matter Multiplication sign Decision theory 1 (number) Set (mathematics) Parameter (computer programming) Semantics (computer science) Perspective (visual) Subset Web 2.0 Mathematics Malware Semiconductor memory Different (Kate Ryan album) Synchronization Single-precision floating-point format Social class Physical system Shared memory Bit Sequence Type theory Process (computing) Befehlsprozessor MiniDisc Right angle Resultant Spacetime Web page Point (geometry) Trail Game controller Server (computing) Service (economics) Computer file Consistency Flash memory MIDI Virtual machine Thresholding (image processing) Emulation Power (physics) Revision control Goodness of fit Telecommunication Hacker (term) Hierarchy Operator (mathematics) Operating system Utility software Condition number Form (programming) Game controller Focus (optics) Standard deviation Inheritance (object-oriented programming) Information Key (cryptography) Consistency Weight Video tracking Content (media) Memory management Limit (category theory) Cartesian coordinate system Cache (computing) Kernel (computing) Rootkit Personal digital assistant Utility software Oracle
Group action Building Thread (computing) State of matter Multiplication sign 1 (number) Set (mathematics) Port scanner Disk read-and-write head Mereology Semantics (computer science) Usability Facebook Mechanism design Spherical cap Semiconductor memory Core dump File system Flag Information security Resource allocation Pressure Physical system Email Theory of relativity Constraint (mathematics) Block (periodic table) Real number Software developer Interior (topology) Electronic mailing list Maxima and minima Bit Measurement 10 (number) Type theory Arithmetic mean Befehlsprozessor Process (computing) Exterior algebra Facebook Buffer solution output Normal (geometry) Right angle Metric system Freeware Resultant Point (geometry) Web page Trail Game controller Functional (mathematics) Implementation Computer file Link (knot theory) Virtual machine Limit (category theory) Number Revision control Goodness of fit Read-only memory Operator (mathematics) Queue (abstract data type) Domain name Pay television Scaling (geometry) Surface Limit (category theory) Software maintenance Cache (computing) Kernel (computing) Film editing Software Radio-frequency identification Data center Pressure Freezing
okay so hi my name is Chris I already
mentioned I work as a production engineer at Fitz for London working as a member of Webb Foundation I'm gonna be giving a whistle-stop tour of the new version of control groups added in Linux 4.5 don't worry if you haven't the faintest clue what control groups are yet I guess since we're in the containers dev room probably some of you have is some kind of idea but I'll go into kind of what they are where you may have used them and a comparison of the
old and the new in the next few slides so like I said in this talk and it'll be giving kind of an introduction to control groups I'm going to go over what they are where you may have used them where you might have encountered them before if you already know something called control groups like you've you fished around previously or you've used them before you almost suddenly have been interacting with version 1 of control groups control groups as it has existed in the kernel since a long long time ago since around 2008 and it's been kind of building on our love of containers it's one of one of the building blocks of containers as we know them so it's definitely got a whole bunch of good things in it it's also got like a whole bunch of problems usability foibles other kinds of things which are not so great so I want to go into like what those are and when you might encounter them and how we try to improve that insecurity to cigarete one is currently now kind of in mostly a maintenance mode where it's secretly - is is more in active development they share the same core it's mostly the the user facing API which is kind of different and yeah I'm also going to go over kind of what's what's still to be done and what's what's already been done the core of secret v2 is already stable since kernel 4.5 but we have a whole bunch of work we want to be done a lot of a lot of the core work has been to enable future work to enable new features in C groups which were already working on so that's a little bit about me I've been working at Facebook for about three years I like I said I work in this team called weapon addiction technically web foundation is the team responsible for the web servers a facebook but web servers are kind of not not usually extremely complicated like the stateless they serve you cabbages and that's about it so we actually ended up becoming kind of a team which deals with the general reliability at Facebook this means we get involved with production in discussions and all sorts of we generally act as kind of the guardians of reliability at Facebook yeah so I like I said I work at Facebook London we have a we have offers there we also have an office in Dublin in your opponent and Intel beef yeah and we have a whole bunch of different kinds of people in web foundation we have people who are experts at learning some pozzolanic stack which is basically my expertise we also have people who are experts in our cache architecture RPC and say like the web push and stuff like this especially a cross-functional group of experts which all come together to deal with when really hits the fan so that kind of brings up the question why why do we as a team like care about syrups and why don't why does Facebook as it come to care about C groups so we have many many hundreds of thousands of servers and we have a bunch of services which run on these servers some of them co-located on the same servers some of them spread throughout servers it's all kind of all over the place so most kind of outages at Facebook are a few there are a few kinds of outages but like one very common one is failure in multiple systems and there are a few things you need to do to mitigate that another one of course is actively mapping out your dependencies and making sure you have an understanding of your dependencies but another big one is making sure that when two services run on on a single machine you don't end up with a situation where one service completely overrides the other and results in it becoming completely useless so that's one of my main concerns about security to them one of the reasons why I'm super interested in it so a typical use case for this is like you say on a normal web server we have we have three types of running processes same goes for most other kinds of web servers other kinds of servers as well so you have your core workloads your core work load is the thing which if you were to describe to somebody else what your server was actually doing you would say it does this thing it serves web request it does load balancing but usually you don't end up only with this on your server you usually end up with a whole bunch of other stuff especially if your company has been around for a while or you have some ideas about how you should architect stuff you end up with a whole bunch of non core services which can probably be used interchangeably with kind of system services this might be stuff which just comes with like girls so the kernel workers or it could also be like say Kerberos demons or it could be like stuff which is related to your business needs say metric collections that we can work out what is going on with our server like whether we're managing to serve users correctly but it's it's really really bad if then this matter collection decides it's going to take of all the available memory in your server and then you can't actually serve users sure it can tell me it did that but I don't give a if it like completely up the whole web server right same goes for cron jobs and chef I care about my server being up to date but I would rather have some system which which actually acts in a reasonable manner if chef starts taking a bunch of memory I would rather it to got degraded service and still manage to run then it took all of the memory or took all of the CPU and then everything was like I would much rather have that bad outcome then we couldn't have this third class like ad-hoc queries and debugging these are typically things you don't know that you need and don't end up getting run on the majority of servers these are like things which you end up realizing you need only when an incident is already happening and we kind of want people to be able to dynamically be able to determine the importance of those things as the incident is going on whether they want it to take precedence over the core workload of the machine or not so cigarettes is a very good use case sorry this is a very good use case for seeing groups like I mentioned if you had some interaction with see groups you've almost certainly be interacting with version one version two has been in development for like five years now it only just got stable in a Linux kernel but even on recent kernels version one is typically what's mounted by default and the reason for that is as you'd imagine with a different version number the actual changes of backwards incompatible so I'm gonna be going over kind of why we've made them backwards incompatible in a moment so the fact that we typically boot with your init system only mounting the virtualmin hierarchy is is a testament to why I'm doing this talk like I know we have a whole whole group full of container experts in this room and this is also kind of a sell to you about why you should give a about secret v2 and why you should take the time to care about it and invest in it in your products so yeah so in the previous slide we talked about multiple processes fitting into each of these C groups as well so a C group can be set as tightly er it's flexible you like it can be all of the processes which related to one service or it can be a single process it can be however many processes you like from zero to however many you like and yeah the idea here is we don't impose a we don't impose a structure on you that one of the guiding principles behind v2 has been we want you to be able to choose how to use it we don't want to have a hierarchy which imposes how you should do things and we want the easy way to be the correct way so it's C group is a control group that went in the same there's there a system for resource management or links like I mentioned a resource here it means like CPU i/o memory and management here can mean for example accounting like we know how much memory is some particular some process in some particular super for using it can also be limiting like we hit a particular threshold and we we take some violin action to curve that it can also in v2 there's also been some work towards throttling which I'll go in as well generally boom killing or whatever is quite violent so we want to have some remediate of actions instead of just just straight-up killing stuff yeah so the way that cigrip v2 cigs v2 and v1 work are essentially you have this hierarchy at CFS syrup it's a just a bunch of files and directories we don't have a system call interface there may well be one in future but we don't currently have one the reason the reason this is kind of a good thing is it's really easy from some user space application even if it's not written in C or C++ and you don't have access to C library function or system calls easily you can easily go and find out the state of your system when it comes to see groups you can just look at some file like I would hope that all of you are using languages which could like open a file make a directory remove a directory like yeah that's I would hope like whatever the new hep kids are using still supports those kind of things so yeah so each resource interfaces is provided by a controller you'll probably hear me use the word controller resource and domain kind of interchangeably the idea is you write files in like some particular that apply to some particular controller this controller takes the values which you provided and it provides them to the kernel and the kernel makes some kind of decisions based on what values you gave it so it's essentially the backing store for all of the stuff which you input the cigarettes so it's mentioned previously like workload isolation is a really big use case for cgroups so you might have many background jobs on a machine but you don't want them to override the main workload that's the case which we talked about first there are also some other kind of other cases like say you have a tier which runs which runs asynchronous strobes in the background we have we have that at Facebook and some jobs have higher priority than others priority is a very abstract concept but priority usually has something to do with resources it usually means either you expect it has a small memory or or less view or something like this it's very much up to you like what exactly that priority means you also have shared environments say you're a VPS provider or something like that and you don't want some particular user to be able to use that container and steal all the resources from all your other customers which are now gonna go and leave you a bad review so yeah so there are kind of a lot of use cases for super 2 and you might ask hey my my favorite my favorite application X like it already does this why should I give a single about C groups well the answer is if you have been using any kind of software in the last eight years I would really pray that it does it through C groups it probably does it transparently to you and you don't ever have to like talk to secrets directly but the the backing for all of these is C groups like that's how
they do resource limiting so let's go concretely over how this works in in version one so in v1 sisyphus c group contains the names of all the resources which which have a controller so this might be CPU memory pins that kind of stuff and inside these inside these resource directories there's another set of directories which are the C groups themselves so these C groups exist in the context of this resource and you put processes
into those C groups for example we have the pidz one at the bottom so because this is to do with pit pin do resources it contains falls in those directories and those files are related to how many
pins you can have on us here for example and we also have the concept of like having its own hierarchy for resource distribution so in C group integral v1 you have a resource and then you have a superb hierarchy each resource so even if see group three here was called the same at see group one say they were both called foo from the Connells perspective they have absolutely no relation to each other this is kind of important because if you look at how for example system D sets out C groups typically you often end up with quite similar looking hierarchies in different resources and you might even be inclined to believe they have some relation to each other well from system Diaz perspective I'm sure they do but from the kernels perspective they do not and that results in a whole slew of like subtle issues which cause problems at scale which I'll go into in a moment see groups see groups are nested inside each other in this example so when a see group is nested inside another typically what it means is it can control some limited amount up to the maximum amount of its parents so if you have if you have a memory C group and then you have a child of a C group which is also that which is also another C group then you can limit up to the maximum of its parent so the results that that controls the C groups or the secret hierarchy that they're in determines what kind of falls there are I already mentioned that if you're in the memory hierarchy you can only access files related to memory like for example that's as far memory dot limit and bytes you can read the limited bites from it or you can write to it and set a separate limited bytes and one pit isn't exactly one C group per resource in C Group E 1 so P 2 here is is explicitly assigned to suit two resources a and C in C group 1 and 5 respectively but because we don't assign it to anything in resource B it's actually goes to this route C group the route C group is kind of a special concept in in in C groups it's essentially unmanaged territory how exactly it manages up to the controller but the idea is you don't really have the opportunity to set really any limits in the route C group because it's just the the starting point for a resource distribution of this resource across your whole machine so yeah you do get some kind of accounting but in terms of limiting it's basically useless so here is a concrete look at how this looks in security ones so you have set of SC group and then you have the resources which are like blocker mmm pits and then you have the C group names and have a nested cigrip team a inside BG for two resources and be inside out hook for two resources so once again just as a reiteration because this is really important that you get this from the Colonel's perspective naming has no meaning like if it's in a different resource even if it has the same name it has no meaning and that has all sorts of weird implications so here's how it looks in cigarette v2 by comparison so in super v2 we actually don't see the resources anymore if you remember how it looked in version 1 we have resources under sisyphus a group but now we actually have the C groups themselves under sisyphus C group so how do these see groups understand which resources they are supposed to apply to if they're not inside a resource hierarchy well the answer is kind of cigarettes a global now they're essentially a global a global set of C groups and you enable resources inside of the C groups this means that we have one hierarchy to rule them all and the idea is like you write to a special file you tell us what particular controllers you want to enable and we enable them for your C group we don't require you to create different hierarchies each time you instead create one hierarchy and enable C group controllers at will so this is how the previous example now look since iqra' - so integral v - like I mentioned we now have the C groups directly at the bottom but you write to this special file see group dot subtree control and that enables the in the truck in the children of that C group that those controllers are enabled essentially if you were not to enable them there but they were enabled the next level up it means they would compete freely for those resources that you didn't enable so yeah so here's the version one hierarchy again for comparison and as you can see here we have resources first and remember that in version one secrets with the same name again don't have any relation to each other in super V - we have this unified hierarchy and you enable resources for C groups children by writing + memory + pays per CPU + io that kind of stuff - secrets actually control and when you do this the files appear in that directory instantaneously oh and another thing to mention is in real life you also need to enable the memory pins and i/o controllers at the top level for this to work but for the sake of simplicity I'm not out of them here so the fundamental differences are obviously the unified hierarchy resources apply to see groups now instead of see groups applying to some some distance across the hierarchy for a resource this is very important for some kinds of common operations in Linux like for example you have page caps Rybak's and paved path right backs transcend one particular resource they they happen across a whole bunch of different resources and we need to be able to we need to be able to consider these actions together in unity to be able to form from reasonable limiting or other actions we also have granularity at the threat group ID not the thread ID level this is a contentious point but it's kind of important in sigrid one you could essentially put different threads from the same process in two separate C groups this has a whole bunch of weird implications like for example people would put say different threads from the same process into different memory C groups I don't know how the that's supposed to work like you have like basically the entire shared memory between these two C groups I know people have done insane with C groups this is this is the main the main thing is like we want a guy towards a reasonable annotation because it's not that these people are stupid that's not the problem it's just the secrets everyone we're like quite over complicated so it made people do some insane so now limiting the paid level kind of gets us like a more reasonable approximation of what people generally want also without extensive cooperation even further resources where maybe it's would in theory make sense to to have different different threads from the same process in different resources it often ends up being like you you have to have some way to communicate which thread from your process is doing what and there's no standardized way to do that in Linux right like you can set the the comm of your thread to like some value and then and look at like the value somewhere but it's not standardized and it's like really really hard to reason about so yeah it usually doesn't act in any reasonable way in general like there's been a focus on simplicity and clarity in nv2 over like ultimate flexibility v1 invented at the kind of the dawn of containerization people didn't know what they wanted they just know that they wanted something and they wanted it now so v1 was kind of a solution to the problem and v2 is kind of more like a more a more developed approach to the problems we know we're having for sure now so another new feature in v2 is the addition of this no internal process constraint so this means that C groups with processes and controllers enabled cannot create child C groups essentially this means that in in simpler words these red C groups either have to have no processes or they have to have no controllers enabled in that part of the hierarchy this is for a few reasons it's kind of hard to reason about how that should act usually in in v1 this within v1 this was allowed and the problem is now you have two different types of objects kind of competing against each other so say you have processes in I and then you have some Charles C groups under underneath I here now you have processes which are in I competing against C groups which are its children and it's difficult you have to make some kind of you have to make some kind of judgment about how we will treat processes compared to C groups maybe we can consider each one its own secret maybe we can consider them like I prime like a separate C group but it's quite hard to reason about them for most cases the better solution is just create another soup and so this is another guide towards like helping people create a sane hierarchy and the route C group is a special case the the controller's themselves have to decide how they're gonna handle resources in the root so clearly breaking the API is kind of a big deal like Seagram the seizure break guys is a pretty big deal like the fact that we want to create v2 instead of instead of just improving everyone obviously need some good some good reasoning there so v1 works ok for basic situations but it gets kind of exponentially complicated when you're getting more and more complex as I mentioned in v1 design kind of often followed implementation and trying to rework kernel api's after the fact is really really really hard like you can't change like the fundamental nature of the kernel a carts which people rely on day-to-day in production that's just not something you can do even for stuff which was designed upfront like I mentioned like generally the use cases for containers and see groups were not really that well fleshed out yet they originally started as as like only for CPU and then it grew and grew and grew kind of naturally it was generally hard to gauge at the time how C groups will be used so now this is an opportunity to redesign them and work them how we actually think they should be so to fix these kind of fundamental issues you need to have an API break and that's why
the v2 was created so I'm going to go over some of the the actual practical improvements because I've talked a lot about like the theoretical how we design it but I also want to go over why we've
designed it in that way and what that actually means so ok pop quiz when you write to a file in Linux what happens it's not a trick question that's where you're gonna follow this okay you get a file descriptor ok not that bad was that was possibly before what I wanted okay so does it write directly to the disk okay so where does your day to go I got about five different answers I'm not sure what note they were okay so it goes you write a dirty page into into the into the page cache you have like some some subset of dirty pages now and you're right Siskel came back and it said everything's everything's great from your applications perspective now you can go on pretending yes it was written to disk and my writes is for succeeded so it must be written to disk and you can have all this class of wonderful belief but ultimately it's not been written to this right it's it's still sitting in memory somewhere and if you shut down the machine right now shits gonna go hey Wow so yeah so there are multiple operations here to be considered first you have the your writes a spool which then goes and writes it at dirty pages returns to you and then later some kernel worker like beauty flash comes and it says okay now is the time by some magical standards like the inode dirty ratio I've decided I'm gonna flush out these two discs in v1 we don't have any account we don't have any tracking of this so if you if you've wrote dirty pages we don't know where they came from afterwards when we flush them to disk so that IO goes to the root C group and we can't countered to your process or Josi group simply because it wasn't tracked it wasn't track to where this page come from in v2 it is tracked and we can actually count these towards your your limits and we can also make kind of reasonable decisions about like oh you have I owe contention and this is what I should do based on that or you have memory contention as well as you do based on that when you're trying to do a page guess right back v2 is also generally like kind of better integrated with subsystems so most of the actions we could do based on thresholds in version 1 were kind of crude or or in the case of the memory subsystem like quite violent you said like a limit with memory don't limit in bytes and what happens is you you so your application has like a woman it spike in memory usage and then the killer goes along and it goes oh I'm gonna kill you and that's like the standard method of dealing with things like usually like processes don't particularly like being kill 9 I I don't know like maybe like there's some kind of sadistic processes which like that but I mean ultimately it's not not a very a very good way of limiting results users what you've really want to do is like tell it ok calm the down and like stop allocating memory or in in the Bedok in the better case where you can't tell to calm the down you want to tell the operating system hey that guy's gone not so I would like to now like take some action against this guy but that action doesn't have to be like slay him where he stands yeah I would like to think in human society we've gone past the point where like the penalty for any kind of failure is instantaneous death so yeah so now in super V 2 we have like generally kind of better thresholds here we have we have a new thing called memory dot high which instead of killing a process we still have memory access to memory not limited bytes which um killed your process but we also have memory dot high and what memory not high does is when you pass this threshold we start to do throttling and reclaim when for every single matter memory allocation so when you go over memory dot hi and you want to do another malloc or you want to get grab some more memory then what we do is we break into a separate path in the kernel and we say hey I would like to I would like to dial back this user so what I'm going to do is go to the tail of the inactive list and start reclaiming pages so if you fail to reclaim any pages it's still kind of good because you took awhile to scan the page cache it took you a while to scan the page guess which slowed down your application now and we've done it in a way which is kind of transparent to your application but if you do manage to reclaim pages then we also win because now you've retained some pages than you managed to get some memory free again in fact this is like a generally a much saner way of doing things and this is like using this on web servers was like a big win when you see like these spikes in results users so notifications I don't know like how notifications are like one of the more educators foresee group since it usually ends up being people that like system D which and abusing them but notifications are essentially a way to say hey it's something in my soup change state it could be like oh I have no more processes in my C group so all of them have finished it could be like oh one of my processes ran out of memory and I'm gonna take some action based on that ultimately it's a way to get information about what is happening in your secret system D uses this for example to track which processes are running and the state of the state of your system and the state of the services which you're running the problem is on v1 for for release notifications which is the notifications which are sent when your secret past normal processes which for example means like oh we exited you have to designate a what's called a release agent and this release agent you it's just like giving a cord of utility you you tell it like here's the path to my executable and when you have no normal processes go and execute this this this thing with these arguments the problem is now if you have if you're using C groups as utility where say you have C group's expiring like a thousand times a second you're not like forking a thousand times a second as well which is pretty bad like it's generally pretty expensive and it doesn't make a whole lot of sense since the rest of the the rest of the secret API used slightly more sane methods using a bent FD so now we have inotify support everywhere system etsu group looks like a possum it kind of makes sense that it's supports inotify we still have the vent empty support so you can like poll and find out what the answer is the idea now is you can have one process to monitor everything you like you don't have to fork a new process every time the vent is created this is just like a straight upgrade really so utility controllers are another thing utility controllers or controllers essentially most most controllers in NC group are related to some particular resource say you have memory CPU aisle there are other ones however like safe path or freezer which I'll also interest in a second which are not related to some results but you put the processes in that C group based on some actions you want to take to them as a group so the idea is like basically you want to group them together so some user space utility can take some action based on that group pup is a tool for performance monitoring and tracing in I guess quite a few people have probably probably heard of it and the way it works is when you say I want to have a certain set of sequences you give it a cigarette cigarette path and it says okay here is like some particular set of processes which I'm gonna map into my own superpower okey so now you have six FS C group path and inside there is a completely separate secret hierarchy which only relates to puff this usually doesn't make a whole lot of sense usually what people want to do is monitor an existing secret hierarchy not create a new one so people resort ended up resorting to like tons of hacks like oh I want to like copy over this hierarchy to the other one they would run like a tool to copy it from one one hierarchy to another and like you end up with all these race conditions and like horrible and it was like really really bad so now having a unified power key means we don't have to sync you only have one hierarchy so there's no way this could possibly go wrong touchwood so everyone there's also like a lot of inconsistency between controllers this usually comes in in kind of two forms when you have inconsistent api's between controllers which do exactly the same thing so you have like the cpu both CPU and IO are essentially weight based or share based you give a certain amount of some resource based on a relative amount to some some other particular C group but the apos were completely different the APR that you had to learn api's to do one thing which is really really not ideal so there's been a lot of focus on trying to unify the api's and also unify the naming like I generally like it was a bit of a crapshoot everyone and now we have the opportunity to to rethink those names and generally standardize them a bit more so v2 is generally more intuitive up front another one is kind of inconsistent secret semantics for example rootsy groups for example sorry most C groups inherit their parents limits so when you create a when you create a child C group of some particular group it usually and it usually can only use up to its parent limits but some some controls didn't do that some controls did their own thing for some controllers this whole idea of a hierarchy was like an imaginary thing and it just created a new C group and didn't care where it was it was all a bit of a crapshoot so now with one unified hierarchy it's it's it's more difficult to it up I guess so we ones everyone's over flexibility also contributors to like a whole bunch of API problems for example when memory limits were first created they only limited a few like a couple types of memory and they were they were in this
file memory limited bytes then as more and more memory types were added they ended up getting like their own files one by one so you ended up with memory don't limit in bytes memory don't came in not limited by its member came MTC fear of limited bytes member mebiner bytes and the really bad thing about this is now yes I have very granular control but it's not very useful because now say I want to let set a limit on the maximum number of TCP buffers which is set with memory Dottie's memory document TCP dilemna bytes now say I have like 10 gigabytes of page cache free and I set a limit on the number of TCP buffers if I said like oh you should only get say X amount of TCP pokers if you go one over we don't kill you even if you had 10 gigabytes of page cut free or something like that it's not a reasonable way of operating usually like most people don't want but most people don't care like you allocated one TCP buffer too many they want to give some kind of idea about the overall memory use and like reasonable memories generally unified limits a only reasonable way to approach that so yeah again another trade-off kind of in favor of in favor of usability over over ultimate flexibility generally if you do want to limit these things say you wanted to limit some particular kind of results like the paid controller is a very good example the pit in the early days of C groups it was considered maybe we could limit the number of hits by like limiting certain cut types of kernel memory but that turns out to be really really hard like so the way that was fixed was we now have a pit controller and that specifically controls this resource so if you do want to do very specific kinds of limiting like something TCP buffer or something else then you should do that for a new controller that's that's the reasonable way to do that so if you go to right now there is a one in ten chance you're going to hit a server with superbee two we are running a security to pull in like the tens of thousand machines now and we're in we're investing heavily in cigarette v2 for like a number of reasons my main concern like I mentioned at the beginning is limiting the failure domains between surfaces I really care a lot about making sure that we don't have cascading failures or anything like that on a machine and also being able to go to manage the resource allocation in your data center especially at Facebook sellers it's really important like if we can suck at this that a little bit more a resource efficiency at the data center and that's that's a really big win we won secretly to manage with system D one of my teammates dr. kavak I was sitting back there did a talk about this at system D don't Kampf last year called I believe it was called deploying system D at scale correct me if I'm wrong yes it was so yes we're a really big contributor to to the core of secret v2 we have two of the core maintainer z' working at Facebook and we will continue to drive innovation here like this is a big a big thing we work on right now so I already mentioned that Sigrid B to is kind of new Sigrid two has been usable for a little while now and the core is all very stable but that doesn't mean that there isn't still work to be done here like a lot of this is kind of building the building blocks for future work so the core API was are stable but there's definitely functionality to be worked on when thinking about C groups like most people think of kind of three things I guess which is IO memory CPU those are like pretty much the biggest ones and two out of three of those are merge currently the pic controller is also much for the CPU controller like it's very important but there have been some disagreements with the CPU subsystem team about how to merge it they have some disagreements about the new internal process constraint and also like having process granularity instead of internal thread granularity and stuff like this yeah there's a very juicy drama-filled thread at that link as as old Linux kernel mailing lists threats are it's probably still better than usual we also may end up with some thread based API for some particular kinds of thread operations that make sense that would be like in in the works I'm not a kind of big bet that we're doing right now is one thing that the Linux has never really had is a good metric for memory pressure like we have many kind of related metrics like the amount of memory you have free or the amount of memory used you can also look at stuff like certain kinds of page scans or like yeah it's it but it's all very curious t'k and ultimately none of these metrics prove that you actually are encountering memory pressure because they can also happen in a bunch of normal scenarios so our proposed measure is to track page refolding so we the essential way this will work is we track pages which are consistently revolted back into the inactive queue and essentially the way to explain how this whole thing works so you have the inactive set those are pages which the kernel considers are probably not being actively used by any process then you have the active set which it considers are more likely to be used by some some process and it's essentially one big list and the way it works is when you have a page fault you go to the head of the inactive list if you've got if a page was accessed again then it gets moved to the head of the active list which means it's protected from from reclaim and when we do a reclaim we go from the tail of the inactive list so we take the pages which we consider the least likely to be used of a reclaim so what happens if we keep on faulting pages in and they get so far to the end that we keep on reclaiming them and then they fall back in again immediately that probably means that we have too many pages like at once for our system to handle means that we can't we end up pushing them so fast off the edge that we simply don't have the resources to be able to deal with this number of pages that is probably not a bad metric for for for memory pressure and it's one which is currently being worked on as part of the security to effort as well because we do want to have those metrics around memory pressure not just memory usage which is kind of only tangential to the thing but you really want to know so sorry question I thought I had a question so we we also have tracking now of page caps right max well this first point means is we have the tracking of memory and i/o but we don't have tracking for CPU yet like you spend some time on CPU waiting for network packets to come in on the network we and we don't know who it's for yet because they didn't get routed yet and we can't account for that yet I mean also can't account for the CPU we spend like doing doing a page cache right back yet that's something which is going to take quite a bit of effort but it's something which would definitely work on at this point yeah and that's for the second point I already mentioned like the the resulting generally the idea here is that we want better metrics for for memory pressure because right now we only have something tangential you who used v1 I'll probably also know about freezer freezer is like an alternative to killing for example like you can freeze some set of processes in their state and then go and decide like oh I want to raise the memory limit or I want to kill them or I want to stop so some new processes it was essentially a way of freezing them in time and having some other process come and decide what to do about it in in sigrid one this didn't work at all basically like if you if you used freezer one of the most common things you might want to do is go and like get a stack trace and work out what they were doing when like they kept on allocating memory or whatever it was that you froze them for but like it's a very common situation that you would end up with like say gdb if you were try to attach ddb it would end up in D state which is not really the ideal result if you want to like find out the stack of some process which froze like that's generally the complete opposite of what what I would like the reason is like the freezer implementation in v1 doesn't guarantee that we stop anywhere reasonable we we often stop in a stack which makes absolutely no sense in the kernel so envy - the idea is to have like a more kind of six top style mechanism six top is very well defined and where it stops is very well-defined as well so it's it's kind of a more reasonable solution to use first stopping processes so the the v2 implementation of freezer will be more along those kind of semantics so I tooked a lot here about trying to sell you cigrip you - but I should actually tell you how to get it probably at some point during this talk so hopefully you're interested in trying it out yourself so here's what you need to get started with version 2 first you need a kernel about 4.5 before that we do have a developer flag which you can go find I won't tell you because it'll like eat you if you try and use it and you're done what you're doing but I really wouldn't recommend using it before 14 54.5 was the first point we have a stable API once that's done more or less two things to do you need to turn off all of the controllers for v1 and you need to turn on unmount the the filesystem for v2 typically you want your init system to do this for system do you use it with this flag you basically put both of these on the kernel command line but if you're crazy or you want to try it yourself you can also manually mount it and cry when things break so if you're interested in hearing more about cgroups come talk to me i'm happy to go over any of what i've been talking about on v1 or v2 and yeah are we have to go over any questions you might have and if you've used me one in the past which I guess many of you have and you found it lucky in some areas please do try out v2 and let us know what you think thanks [Applause]