7 years of cgroup v2: the future of Linux resource control - TIB AV-Portal

7 years of cgroup v2: the future of Linux resource control

00:00

1

Zugehöriges Material

Formale Metadaten

Titel

7 years of cgroup v2: the future of Linux resource control

Serientitel

Anzahl der Teile

542

Autor

Lizenz

CC-Namensnennung 2.0 Belgien:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.

Identifikatoren

10.5446/61812 (DOI)

Herausgeber

Erscheinungsjahr

Sprache

Inhaltliche Metadaten

Fachgebiet

Genre

Abstract

Control groups (or cgroups for short) are one of the most fundamental technologies underpinning our modern love of containerisation and resource control. Back in 2016, we released a complete overhaul of how cgroups work internally: cgroup v2, released with Linux 4.5. This brought many new and exciting possibilities to increase system stability and throughput, but with those possibilities have also come challenges of a type which we have largely not faced in Linux before. This talk will go into some of the challenges faced in overhauling Linux's resource isolation and control capabilities, and how we've gone about fixing them. This will include some of the most complex and counter-intuitive practical effects we've seen in production, with details of how our expectations and knowledge have developed over the last 5 years using this on over a million machines in production, with insights that are immediately applicable to anyone who runs Linux at scale. We will also go over the state-of-the-art of resource control in the "real world" outside of companies like Meta and Google, looking at how cgroup v2 is changing the technical landscape for distributions and containerisation technologies for the better.

Sprache

Text

Bild

00:00

PufferüberlaufWeb SiteTypentheorieMathematikMereologieVirtuelle MaschineKontextbezogenes SystemElektronische PublikationWärmeübergangRegelungZentrische StreckungHalbleiterspeicherMeta-TagDifferenteInterface <Schaltung>Metrisches SystemMini-DiscCodeCachingHierarchische StrukturMultiplikationsoperatorStrömungsrichtungBefehlsprozessorEinfache GenauigkeitKernel <Informatik>Prozess <Informatik>Komplex <Algebra>Web-SeiteSystemplattformResultanteHackerKonstruktor <Informatik>Dreiecksfreier GraphKanalkapazitätHardwareSichtenkonzeptAussage <Mathematik>EinfügungsdämpfungBitLokales MinimumGraphische BenutzeroberflächeRelativitätstheorieRechenschieberPunktSoftwarewartungOrdnung <Mathematik>Arithmetische FolgeDatenverwaltungNichtlinearer OperatorWeb SiteSkriptspracheComputervirusGruppenoperationSpeichermodellRechenzentrumInformationsspeicherungPhysikalischer EffektPuffer <Netzplantechnik>Ubiquitous ComputingProgrammierungInzidenzalgebraKartesische KoordinatenLeistung <Physik>SchnittmengeDämon <Informatik>QuellcodeSpeicherabzugDigitalisierungLaufzeitfehlerDelisches ProblemMetropolitan area networkKomplexes SystemInverser LimesBildschirmmaskeKlassische PhysikBinärcodeBeanspruchungSummengleichungPhysikalisches SystemMechanismus-Design-TheorieFestplatte

09:27

MultitaskingPhysikalisches SystemGammafunktionKartesische KoordinatenHalbleiterspeicherWeb-SeiteBeanspruchungDämon <Informatik>Rechter WinkelDienst <Informatik>GrößenordnungBenutzerbeteiligungDruckverlaufSchaltnetzStörungstheorieGruppenoperationProzess <Informatik>Hierarchische StrukturKernel <Informatik>BitInverser LimesMultiplikationsoperatorZahlenbereichVirtuelle MaschineMini-DiscCachingDualitätstheoriePuffer <Netzplantechnik>HeuristikFlächeninhaltSocketAlgorithmusMetrisches SystemErwartungswertPhysikalisches SystemMeta-TagCodeLesen <Datenverarbeitung>InformationVerzweigungspunktSchwellwertverfahrenMereologieProdukt <Mathematik>NP-hartes ProblemDreieckSpeicherverwaltungTermEinflussgrößeArithmetisches MittelThreadFestplattePortscannerCASE <Informatik>GrenzschichtablösungZweiInhalt <Mathematik>Elektronische PublikationHardwareStrategisches Spiel

18:48

Nichtflüchtiger SpeicherW3C-StandardBeanspruchungMultitaskingHauptidealringProzess <Informatik>SpieltheorieApp <Programm>SchaltnetzAlgorithmusPhysikalisches SystemMini-DiscEntscheidungstheorieWeb-SeiteCachingMathematikCodeCASE <Informatik>Zusammengesetzte VerteilungVollständigkeitBeanspruchungInformationsspeicherungIndexberechnungServerHalbleiterspeicherSchwellwertverfahrenLineare RegressionBitKartesische KoordinatenBandmatrixSoundverarbeitungGemeinsamer SpeicherBenutzerbeteiligungArithmetisches MittelVirtuelle MaschineNichtlinearer OperatorSpeicherverwaltungBefehlsprozessorVerschlingungEndliche ModelltheorieProdukt <Mathematik>MultiplikationsoperatorResultanteMereologieAutomatische DifferentiationMultiplikationKeller <Informatik>Gewicht <Ausgleichsrechnung>TypentheorieRechenwerkLokales MinimumRandomisierungAggregatzustandMinimumSchreiben <Datenverarbeitung>Defaultsinc-FunktionOrdnung <Mathematik>FreewareKomplex <Algebra>LastRegelungVariableKernel <Informatik>Elektronische PublikationFlash-SpeicherVerkehrsinformationExogene VariableTermOrtsoperatorEinflussgrößeMinkowski-MetrikInverser LimesProzess <Informatik>MAPDatenverwaltungDienstgüteMulti-Tier-ArchitekturGebundener ZustandKontextbezogenes SystemInteraktives FernsehenDruckverlaufp-Block

28:09

Ordnung <Mathematik>HalbleiterspeicherGefrierenInteraktives FernsehenProjektive EbeneMetrisches SystemEreignishorizontHumanoider RoboterProdukt <Mathematik>DruckverlaufComputeranimation

28:38

RückkopplungKernel <Informatik>RegelungRückkopplungMessage-Passing

29:25

Flussdiagramm

Transkript: Englisch(automatisch erzeugt)

00:07

Okay. I think we're ready to start. Oh, excellent. This time it worked perfectly. Thank you so much. Chris is going to talk about C group v2, seven years of C group v2 in the kernel,

00:21

a very exciting time, and the future of Linux resource control. Take it away. Hello, everybody. Oh, yes, please go on. Thank you. That's it. I'm done. Goodbye. Hello. I'm Chris Down. I work as a kernel engineer at Meta. I work on the kernel's memory management subsystem.

00:42

Especially I'm a contributor to C groups, which are one of the things which underpins our modern love of containers. I'm also a maintainer of the systemd project, so there's two things on this slide which you can hate me for. Most of the time I'm thinking about how we can make Linux just a little bit more reliable, just a little bit more usable at scale. We have a million plus machines. We can't just buy more RAM.

01:00

It's not really a thing we can do. So we need to extract the absolute maximum from every single machine. Otherwise there's a huge loss of capacity that could result. So that's the kind of thing I want to talk to you about today. However, the last seven years we have done this at Meta, how we've improved the reliability and capacity and extracted more efficiency. At Meta and in industry we are increasingly facing this kind of problem

01:23

where we can't effectively solve scaling problems just by throwing hardware at the problem. We can't construct data centers fast enough. We can't source clean power fast enough. We have hundreds of thousands of machines and we just can't afford to waste capacity because any small loss in capacity on a single machine translates to a very large amount at scale. Ultimately what we need to do is use resources more efficiently

01:43

and we need to build the kernel infrastructure in order to do that. Another challenge that we have is that many huge site incidents for companies like us and companies of our size are caused by lacking resource control. Not being able to control things like CPU, IO, memory and the like is one of the most pervasive causes of incidents and outages across our industry.

02:03

And we need to sustain an initiative industry-wide in order to fix this. So how does all of this relate to this cgroups thing in the title? So cgroups are a kind of mechanism to balance and control and isolate things like memory, CPU, IO, things that you share across a machine,

02:20

things that processes share. And I'm sure if you've operated containers before, which I'm going to assume that you have, judging by the fact you're in this room, otherwise you may be lost and looking for the AI room, you know, every single modern container runtime uses this. Docker uses it, Core S uses it, Kubernetes uses it, System D uses it. The reason they use it is because it's the most mature platform to do this work

02:42

and it solves a lot of the long-standing problems which we had with kind of classic resource control in the form of U limits and things like that. cgroups have existed for about 14 years now and they have changed a lot in that time. Most notably seven years ago in kernel 4.5 we released cgroup v2. I gave a whole talk around the time when that happened

03:01

on why we were moving to a totally new interface, why we weren't just iterating on the old interface. And if you're interested in a really in-depth look at that, then here's a talk which you can go and take a look at. But the most fundamental change really is that in cgroup v2 what happens is that you enable or disable resources in the context of a particular cgroup.

03:20

In cgroup v1 what you have is a hierarchy for memory, a hierarchy for CPU, and the two will never meet. Those two things are completely independent. System v, when it creates things in cgroup v1, it will name them the same. They get called something.slice or something.service, but they have no relation to each other across resources. But in cgroup v2 you have just a single cgroup

03:41

and you enable or disable resources in the context of that particular cgroup so you can enable, say, memory control and IO control together. That might seem like an aesthetic kind of concern, but it's really not. Without this major API change we simply cannot use cgroups to do complex resource control.

04:00

Take the following scenario. Memory starts to run out on your machine. So when we start to run out of memory on Linux or on pretty much any modern operating system, what do you do? Well, you try and go and free some up. So we start to reclaim some page caches. We start to reclaim maybe some anonymous pages if we have swap. And this results in disk IO.

04:20

And if we're particularly memory bound and it's really hard to free pages and we're having to walk the pages over and over and over to try and find stuff to free, then it's going to cost a non-trivial amount of CPU cycles to do so. Looking through available memory to find pages which can be free can be extremely expensive on memory bound workloads. On some highly loaded or memory bound systems

04:40

it can take a double digit amount of CPU from the machine just to do this walking. It's a highly expensive process. And without having the single resource hierarchy we cannot take into account these transfers between the different resources, how one leads to another, because they're all completely independent. If you've been in the containers dev room before, you're probably thinking,

05:00

I've seen this guy before and I think he's given this exact talk about three years ago. I'm sure some of you are thinking that already. Well, the company name isn't the only thing which has changed since 2020. Also, some cgroups things have changed since 2020. And obviously I don't want to rehash the same things over and over. I don't want to bore you. So this talk will mostly be about the changes since the last time I was here in 2020 with just a little bit of context setting, just a little bit.

05:21

This talk is really about the process of getting resource isolation working at scale. It's what it needs to happen in production, not just in a theoretical concern. The elephant in the room, of course, is COVID. The last three years have seen pretty significant changes in behavior due to COVID, especially for a platform like Facebook,

05:42

which we own, of course. This was by about 27% of what you would usually expect. And this came at a time where not only you're seeing increased demand, but you literally can't go out and buy memory. You can't go out and buy more CPUs. You can't go out and buy more disks because there's a shortage, because there's COVID. So what we really needed was to make more efficient use

06:01

of the existing resources on the machine. We need to have an acceleration or existing efforts around resource control in order to do that to make things more efficient. Now, almost every single time that I give... This sounds like a personal point of concern. Every time I give this talk, somebody on Hacker News comments, why don't you just get some more memory?

06:21

Now, I don't know how trivial people in this room think that is when you've got several million servers, but it is slightly difficult sometimes. For example, there's a huge amount of cost involved there, and not just in money, which is indeed substantial, and I'm very glad it's not coming out of my bank account, but also in things like PowerDraw, in things like thermals, in things like hardware design trade-offs. Not to mention during COVID, you know,

06:41

you just couldn't get these kind of... You couldn't get a hard drive. You couldn't get some memory. You'd go down to your local Best Buy and do it, but that's about it. So, not really an option. So, here's a simple little proposition for you for anyone in the room who wants to be brave. How do you view memory usage for a process in Linux?

07:03

Oh, come on. Free! My man said free. Oh, Lord. This was a trap. So, I appreciate it though. Big up, my man. So, yeah, so free and the like really only measure like one type of memory. They do have caches and buffers in the side,

07:22

but the thing is... Okay, so for free or for PS, which was shut at the back, you know, you do see something like the resident set size and you see some other details, and you might be thinking, hey, you know, that's fine. Like, I don't really care about some of the other things. That's the bit which my application's really using. For example, we don't necessarily think that our programs rely on caches and buffers to operate in any sustainable way,

07:42

but the problem is the answer for any sufficiently complex system is almost certainly that a lot of those caches and buffers are not optional. They are basically essential. Let's take Chrome just as a facile example. The Chrome binaries code segment is over 130 megs. He's a chunky boy. He is. He's a big boy.

08:01

We load this code into memory. We do it gradually. We're not maniacs. We do it gradually. But, you know, we do it as part of the page cache. But what if we want to execute some particular part of Chrome? You know, this cache isn't just nice to have, the cache that has the code in it that runs this particular part of Chrome. We literally cannot make any forward progress without that part of the cache.

08:20

And the same goes for caches for the files you're loading, especially for something like Chrome, you probably does have a lot of caches. So eventually those pages are going to have to make their way into the working set. They're going to have to make their way into main memory. In another particularly egregious case, we have a daemon, a meta, and this daemon aggregates metrics across a machine and it sends them to centralized storage.

08:41

And as part of this, what it does is it runs a whole bunch of janky scripts, and these janky scripts go and collect things across the machine. I mean, we've all got one. We've all got this kind of daemon, where you collect all kind of janky stuff, and you don't really know what it does, but it sends some nice metrics and it looks nice. And one of the things we were able to demonstrate is, while the team had this daemon, thought that it took about 100 to 150 megabytes to run,

09:04

using the things that we'll talk about in this talk, it actually was more like two gigabytes. So the difference is quite substantial on some things. You could be quite misunderestimating, like what is taking memory on your machine. So in cgroup 2, we have this file called memory.current

09:21

that measures the current memory usage for the cgroup, including everything, like caches, buffers, kernel objects, so on. So, job done, right? Well, no. The problem is here that whenever somebody comes to these talks and I say something like, don't use RSS to measure your application, they go and say, oh, we've added a new thing called memory.current

09:42

and it measures everything, great. I'm just going to put some metrics based on that. But it's quite important to understand what that actually means to have everything here, right? The very fact that we are not talking about just the resident set size anymore means the ramifications are fundamentally different. We have caches, buffers, socket memory, TCP memory, kernel objects,

10:04

all kinds of stuff in here, and that's exactly how it should be because we need that to prevent abuse of these resources, which are valid resources across the system. They are things we actually need to run. So, understanding why reasoning about memory.current might be more complicated than it seems comes down to why, as an industry,

10:21

we tended to gravitate towards measuring RSS in the first place. We don't measure RSS because it measures anything useful. We measure it because it's really fucking easy to measure. That's the reason we measure RSS. There's no other reason. It doesn't measure anything very useful. It kind of tells you vaguely maybe what your application might be doing kind of, but it doesn't tell you anything of any of the actually interesting parts of your application,

10:43

only the bits you pretty much already knew. So, memory.current suffers from pretty much exactly the opposite problem, which is it tells you the truth, and don't really know how to deal with that. Don't really know how to deal with being told how much memory application is using. For example, if you set an eight gigabyte memory limit in CRuby 2,

11:03

how big is memory.current going to be on a machine which has no other thing running on it? It's probably going to be eight gigabytes, because we've decided that we're going to fill it with all kind of nice stuff, nice-to-have stuff. There's no reason we should evict that. There's no reason we should take away these nice, you know, K-mem caches. No, there's no reason we should take away these slabs, because we have free memory, so why not?

11:21

Why not keep them around? So, if there was no pressure for this to shrink from any outside scope, then the slack is just going to expand until it reaches your limit. So, what should we do? How should we know what the real needed amount of memory is at a given time? So, let's take an example Linux kernel build, for example, which with no limits has a peak memory.current of just over 800 megabytes.

11:44

In CRuby 2, we have this tunable called memory.high. This tunable reclaims memory from the C group until it goes back under some threshold. It just keeps on reclaiming and reclaiming and reclaiming and throttling until you reach back under. So, right now, things take about four minutes with no limits.

12:00

This is about how long it takes to build the kernel. And when I apply, you know, a throttling, like a reclaim threshold of 600 megabytes, actually, you know, the job finishes roughly about the same amount of time, maybe a second more, with about 25% less available memory at peak. And the same even happens when we go down to 400 megabytes. Now we're using half the memory that we originally used, with only a few seconds more wall time.

12:21

It's a pretty good trade-off. However, if we just go just a little bit further, then things just never even complete. We have to control-C the build, right? And this is nine minutes in, it still ain't done. So we know that the process needs somewhere between 300 and 400 megabytes of memory, but it's pretty error-prone to try and work out what the exact value is.

12:40

So to get an accurate number for services at scale, which are even more difficult than this, because they dynamically shrink and expand depending on load, we need a better automated way to do that. So, determining the exact amount of memory required by an application is a really, really difficult and error-prone task, right? So, SemPy is this kind of simple, self-contained tool

13:01

to continually poll what's called pressure stall information, or PSI. Pressure stall information is essentially a new thing we've added in CGRII to determine whether a particular resource is oversaturated. And we've never really had a metric like this in the Linux kernel before. We've had many related metrics. For example, for memory we have things like, you know,

13:21

page caches and buffer usage and so on. But we don't really know how to tell pressure or oversubscription from an efficient use of the system. Those two are very difficult to tell apart, even with using things like page scans or so on. It's pretty difficult. So in SemPy, what we do is we use these PSI, pressure stall metrics,

13:41

to measure the amount of time which threads in a particular CGRII were stuck doing, in this case, memory work. So this pressure equals 0.16 thing, kind of halfway down the slide, means that, you know, 0.16% of the time I could have been doing more productive work, but I've been stuck doing memory work. This could be things like, you know, waiting for a kernel memory lock.

14:02

It could be things like being throttled. It could be waiting for a reclaim to finish. Even more than that, it could be memory-related IO, which can also dominate, to be honest, things like revolting file content into the page cache or swapping in. And pressure is essentially saying, you know, if I had a bit more memory, I would be able to run so much faster, 0.16% faster.

14:24

So using PSI and memory.high, what SemPy does is adjust just enough memory pressure on a CGRII to evict cold memory pages that aren't essential for workload performance. It's an integral controller which dynamically adapts these memory peaks and troughs. An example case being something like a web server, which is somewhere where we have used it.

14:41

When more requests come, we see that the pressure is growing and we expand the memory.high limit. When fewer requests are coming, we see that and we start to decrease the amount of working set which we give again. So it can be used to answer the question, you know, how much memory does my application actually use over time? And in this case, we find for the compile job, the answer is about, like, 340 megabytes or so, and that's fine.

15:03

You might be asking yourself, what are the benefits of this shrinking? Like, why does this even matter, to be honest? Surely, like, when you are starting to run out of memory, Linux is going to do it anyway. And you're not wrong. Like, that's true. But the thing is, what we kind of need here is to get ahead of memory shortages, which could be bad, and amortize the work ahead of time.

15:23

When your machine is already highly contented, it's already being driven into the ground and going towards the oom killer, it's pretty hard to say, hey, bro, could you just, like, give me some pages right now? Like, it's not exactly like what's on its mind. It's probably desperately trying to keep the atomic pool going. So there's another thing as well which is, you know,

15:41

it's pretty good for determining regressions, which is what a lot of people use for RSS for, right? Like, this is the way we found out that that daemon was using two gigabytes of memory instead of 150 megabytes of memory. So it's pretty good for finding out, hey, how much does my application actually need to run? So the combination of these things means that Senpai is an essential part of how we do workload stacking on Meta.

16:03

And it not only gives us an accurate read on what the demand is right now, but allows us to adjust stacking expectations depending on what the workload is doing. This feeds into another one of our efforts around efficiency, which is improving memory offloading. So traditionally on most operating systems, you have only one real memory offloading location,

16:21

which is your disk. Even if you don't have swap, that's true, because you do things like demand paging, right? You page things in gradually, and you also have to evict and get things in the file cache. So we're talking also here about, like, a lot of granular intermediate areas that could be considered for some page offloading for infrequently accessed pages,

16:40

but they're not really so frequently used. Getting this data into main memory again, though, can be very different in terms of how difficult it is, depending on how far up the triangle you go, right? For example, it's much easier to do it on an SSD than a hard drive, because hard drives don't, well, they're slow, and they also don't tolerate random, like, head-seeking very well.

17:02

But there are more gradual things that we can do as well. For example, one thing we can do is to start to look at strategies outside of hardware. One of the problems with the duality of either being in RAM or on the disk is that even your disk, even if it's quite fast, even if it's flash, it tends to be quite a few orders of magnitude slower than your main memory is.

17:24

So one area which we have been heavily invested in is looking at what we might term warm pages. In Linux, we have talked a lot about hot pages and cold pages if you look in the memory management code, but there is, like, this kind of part of the working set which, yes, I do need it relatively frequently, but I don't need it to make forward progress all the time.

17:42

So Zswap is one of these things we can use for that. It's essentially a feature of the Linux kernel which compresses pages which looks like they will compress well and are not too hot into a separate pool in main memory. We do have to pagefold them back in into main memory again if we actually want to use them, of course, but it's several orders of magnitude faster

18:02

than trying to get it off the disk. We still do have disk swap for infrequently access pages. There tends to be quite a bit cold working set as well, but, you know, this is kind of like this tiered hierarchy where we want to have warm pages in Zswap, hot pages in main memory, and kind of cold pages in Zswap.

18:21

One problem we had here was that even when we configured the kernel to swap as aggressively as possible, it still wouldn't do it. If you've actually looked at the swap code and I've had the unfortunate misery of working on it, you'll learn that the swap code was implemented a very long time ago by the people who knew what swap did and how things worked, but none of them are around to tell us what the hell

18:42

anything means anymore and it's very confusing. So I can't even describe to you how the old algorithm works because it has about 500 heuristics and I don't know why any of them are there. So for this reason, you know, we try to think how can we make this a little bit more efficient? We are using non-rotational disks now. We have Zswap, we have flash disks, we have SSDs.

19:00

We want to make an algorithm which can handle this better. So from kernel 5.8, we have been working on a new algorithm which has already landed. So first we have code to track all swap-ins and cache misses across the system. So for every cache page, we're having to page fault and evict and page fault and evict and page fault and evict over and over again. What we want to do is try and page out a heap page instead.

19:23

If we're unlucky and this heap page actually turns out to be hot, then you know no biggie, like we've made a mistake, but we'll try a different one next time. We do have some heuristic to try and work out which one is hot and which one is not, but they are kind of expensive, so we don't use a lot of them. However, you know, if we are lucky

19:41

and the heap page does stay swapped out, then that's one more page which we can use for file caches and we can use it for other processes. And this means that we can engage swap a lot more readily in most scenarios. Importantly, though, we are not adding IO load. This doesn't increase IO load or decrease endurance of the disk. We are just more intentional in choosing how to apply the IO.

20:03

It doesn't double up. We only trade one type of paging for another, and our goal here is to reach an optimal state where the optimal state is doing the minimum amount of IO in order to sustain workload performance. So ideally what we do is have this tiered model of, you know, like I said, main memory, zswap, and swap on disk.

20:21

This is a super simple idea compared to the old model. The old algorithm has a lot of kind of weird heuristics, as I mentioned, a lot of penalties, a lot of kind of strange things. In general, it was not really written for an era where SSDs exist or where zswap exists, so it's understandable that it needed some care and attention.

20:40

So what were the effects of this change in prod? Like, what actually happened? So on web servers, we not only noticed like an increase in performance, but we also noticed a decrease in heat memory by about two gigabytes or so out of about 16 gigabytes total. The cache grew to fill this newly freed space, and it grew by about two gigabytes, from about two gigabytes of cache to four gigabytes of cache.

21:02

We also observed a measurable increase in web server performance from this change, which is deeply encouraging. And these are all indications that, you know, we are now starting to reclaim the right things. Actually, we are making better decisions because things are looking pretty positive here. So not only that, but you see a decrease in disk IO because we are actually doing things correctly. We are making the correct decisions.

21:21

And it's not really that often that you get a benefit in performance disk IO memory usage instead of having to trade off between them, right? So it probably indicates that this is the better solution for this kind of era. This also meant that on some workloads, we now had opportunities to stack where we did not have opportunities to stack before,

21:41

like running, say, multiple kinds of ads jobs or multiple kinds of web servers on top of each other. Many machines don't use up all of their resources, but they use up just enough that it's pretty hard to stack something else on top of it, because you're using just enough that it's not actually enough to sustainably run two workloads side by side. So this is another thing where we've managed to kind of push the needle

22:01

just a little bit so that you can make quite a bit more use and efficiency out of the servers that exist. The combination of changes to the swap algorithm using Zswap and squeezing workloads using Senpai was a huge part of our operation during COVID. All of these things acting together, we termed TMO, which stands for Transparent Memory Offloading, and you can see some of the results we've had in production here.

22:22

In some cases, we were able to save up to 20% of critical fleet-wide workloads memory with either neutral or even in some cases, positive effects on workload performance. So this opens up a lot of opportunities, obviously, in terms of reliability, stacking, and future growth. This whole topic has a huge amount to cover. I really could just do an entire talk on this.

22:41

If you want to learn more, I do recommend the post, which is linked at the bottom. My colleagues, Johannes and Dan, wrote an article with a lot more depth on how we achieved what we achieved and on things like CXL memory as well. So let's come back to this slide from earlier. We briefly touched on the fact that if bounded, one resource can just turn into another,

23:00

a particularly egregious case being memory turning into IO when it gets bounded. For this reason, it might seem counterintuitive, but we always need controls on IO when we have controls on memory. Otherwise, memory pressure will always just directly translate to disk IO. Probably the most attuned way to solve this is to try to limit disk bandwidth or disk IOPS.

23:23

However, this doesn't really manifest usually very well in reality. If you think about any modern storage device, they tend to be quite complex. They're queued devices. You can throw a lot of commands at them in parallel, and when you do that, you often find that, hey, magically it can do more things. The same reason we have IO schedulers, because we can optimize what we do inside the disk.

23:40

Also, the mixture of IO really matters, like reads versus writes, sequential versus random. Even on SSDs, these things tend to matter. And it's really hard to determine a single metric for loadedness for a storage device, because the cost of one IO operation or one block of data is extremely variable depending on the wider context.

24:01

So it's also really punitive to just have a limit on, you know, how much can I write, how many IOPS can I do, because even if nobody else is using the disk, you're still slowed down to this level. There's no opportunity to make the most of the disk when nobody else is doing anything, right? So it's not really good for this kind of best effort, bursty work on a machine, which we would like to do.

24:22

So the first way that we try to avoid this problem is by using latency as a metric for workload health. So what we might try and do is apply a maximal target latency for IO completions on the main workload. And if we exceed that, we start dialing back other cgroups with looser latency requirements back to their own configured thresholds. What this does is it prevents an application from thrashing on memory

24:41

so much that it just kills IO across the system. This actually works really well for systems where there's only one workload, but the problem comes when you have a multi-workload stacked case like this. Here we have two high-priority workloads which are stacked on a single machine. One has an IO.latency of 10 milliseconds, the other has 30 milliseconds. But the problem here is as soon as workload one gets into trouble,

25:03

everyone else is going to suffer, and there's no way around that. We're just going to penalize them, and there's no way to say, you know, how bad is the situation really, and is it really then causing the problem? This is fine if the thing you're throttling is just best effort, but here we have two important workloads, right? So how can we solve this?

25:20

So our solution is this thing called IO.cost, which might look very similar at first, but notice the omission of the units. These are not units in milliseconds. These are weights in a similar way to how we do CPU scheduling. So how do we know what 40, 60, or 100 mean in this context, well, they add up to 200. So the idea is if you are saturating your disk, you know, best effort will get 20% of the work,

25:44

workload one will get 50, and workload two will get 30. So it balances out based on this kind of shares or weights-like model. How do we know when we reach this 100% of saturation, though? So what IO.cost does is build a linear model of your disk over time. It sees how the disk responds to these variable loads passively,

26:03

and it works based on things like, you know, read or write IO, whether it's random or sequential, the size of the IO. So it bores down this quite complex operation of, you know, how much can my disk actually do into a linear model, which it handles itself. It has kind of a QoS model you can implement, but there's also a basic on-the-fly model using QDepth.

26:22

So you can read more about it in the links at the bottom. I won't waffle on too much, but it is something which you can use to do kind of effective IO control. In the old days, I came to this room and talked about cgroupv2, and the historical response was basically, that's nice. Docker doesn't support it, though, so please leave. I've had a nice chat with some Docker lads.

26:41

No, the Docker people are very nice, and so are all the other container people, and what's happened is we have it almost everywhere. Almost everywhere, cgroupv2 is a thing. We have quite a diversity of container runtimes, and please report. It's basically supported everywhere. So even if nothing changes from your side, moving to cgroupv2 means that, you know, you get significantly more reliable accounting for free. We spent quite a while working with Docker and systemd folks

27:02

and so on and so forth to get things working, and we're also really thankful to Fedora for making cgroupv2 the default since Fedora32, as well as making things more reliable behind the scenes for users. This also, you know, got some people's ass into gear when they had an issue on their GitHub that says it doesn't work in Fedora, so cheers, Fedora people.

27:21

It was kind of a good signal that, you know, this is what we are actually doing. This is what we as an industry, as a technology community, are actually doing, and that was quite helpful. The KDE and GNOME folks have also been busy using cgroups to give a better management of the kind of desktop handling. David Edmondson and Henry Chain from KDE in particular gave this talk at KDE Academy.

27:41

The title of the talk was Using cgroups to make everything amazing. Now I'm not brazen enough to title my talk that, but I'll just let it speak for itself for that one. It basically goes over the use of cgroups and cgroupv2 for resource control and for interactive responsiveness on the desktop. So this is definitely kind of a developing space. Obviously there's been a lot of work on the server side here,

28:01

but if you're interested in that, I definitely recommend, you know, giving the talk a watch. It really goes into the challenges they had and any unique features cgroupv2 has to solve those. Finally, Android is also using the metrics exported by the PSI project in order to detect and prevent memory pressure events which affect the user experience. As you can imagine on Android, interactive latency is extremely important.

28:22

It would really suck if you're about to click a button and then you click it, and that requires allocating memory and the whole phone freezes. I mean, it does still happen sometimes, but obviously this is something which they're trying to work on, and we've been working quite closely with them to integrate the PSI product into Android. Hopefully this talk gave you some ideas about things you'd like to try out for yourself.

28:42

We're still very actively improving kernel resource control. It might have been seven years since we started, but, you know, we still have plenty of things we want to do, and what we really need is your feedback. What we really need is more examples of how the community is using cgroupv2 and problems and issues you've encountered. Obviously everyone's needs are quite different, and I and others are quite eager to know

29:01

what we could be doing to help you, what we could be doing to make things better, what we could be doing to make things more intuitive, because there's definitely work to be done there. And I'll be around after the talk if you want to chat, but feel free to drop me an email, message me on Mastodon. Always happy to hear feedback or suggestions. I've been Chris Down, and this has been seven years of cgroupv2, future of Linux resource control. Thank you very much.

Empfehlungen