Senpai - Automatic memory sizing for containers
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 44 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/46137 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Producer |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
All Systems Go! 20194 / 44
5
10
17
20
21
30
32
36
44
00:00
Read-only memorySystem programmingPhysical systemFacebookGroup actionSpacetimeLimit (category theory)Semiconductor memoryMemory managementKernel (computing)Computer animationLecture/Conference
00:40
Semiconductor memoryPhysical systemRead-only memoryGroup actionProcess (computing)WorkloadLimit (category theory)MereologyKernel (computing)FacebookSemiconductor memoryLastprofilOrder (biology)Data centerStability theoryEstimator
02:28
EmpennageRead-only memorySemiconductor memorySoftware developerTotal S.A.Group actionComputer fileGame controllerKernel (computing)Computer animation
02:52
Read-only memoryElectric currentEmpennageLimit (category theory)System programmingSemiconductor memoryDistribution (mathematics)WorkloadSource codeOrder (biology)Process (computing)Limit (category theory)Computer fileMultiplication signCompiler
03:39
Semiconductor memoryDistribution (mathematics)FrequencyGreatest elementLine (geometry)Touch typingSingle-precision floating-point formatSource codeRight angleGraph (mathematics)WorkloadNetwork topologyDistribution (mathematics)CompilerKnowledge-based configurationProcess (computing)Object (grammar)Kernel (computing)Computer fileCASE <Informatik>Semiconductor memoryConfiguration spaceComputer animation
04:54
Semiconductor memoryDistribution (mathematics)Read-only memoryShared memoryMultiplication signMultiplicationProcess (computing)Limit (category theory)CompilerSet (mathematics)Source codeSemiconductor memoryMixed realityWorkload
06:17
Maxima and minimaSoftwareVariable (mathematics)Scale (map)Artificial neural networkPressureRead-only memoryRange (statistics)RootPersonal digital assistantMetric systemSemiconductor memoryFacebookDifferent (Kate Ryan album)Error messageService (economics)IdentifiabilityWeb pageArtificial neural networkMultiplication signComputer fileServer (computing)InfinityPressureProcess (computing)Web 2.0Source codeWorkloadSet (mathematics)Graph (mathematics)Scaling (geometry)BitTask (computing)Cartesian coordinate systemKernel (computing)Event horizonRight angleCache (computing)Row (database)Operator (mathematics)Variable (mathematics)Computer animation
09:33
Range (statistics)RootPersonal digital assistantPhysical systemMultiplication signTotal S.A.Linear regressionTask (computing)Event horizonWeb pageSemiconductor memoryRootRun time (program lifecycle phase)Stack (abstract data type)DigitizingMereologyPattern languagePressureCausalityDoubling the cubeVirtual machineWorkloadProduct (business)Bit rateDifferent (Kate Ryan album)RoutingIdentity managementComputer filePrice indexComputer animation
11:20
PressureRead-only memoryWorkloadPressureKernel (computing)Multiplication signWorkloadLine (geometry)Semiconductor memoryProcess (computing)FeedbackDrop (liquid)Group actionComputer animation
12:11
Electric currentRead-only memoryEmpennageUser profileServer (computing)EstimationCache (computing)Buffer solutionProcess (computing)Network socketStapeldateiComputer fileLine (geometry)Right angleOcean currentWeb 2.0SoftwareServer (computing)WorkloadAverageKernel (computing)Single-precision floating-point formatSemiconductor memoryProjective planeGame controllerVirtual machineBinary codeFacebookReduction of orderGraph (mathematics)2 (number)PressureStructural loadLastteilungCache (computing)Building
14:42
EstimationNetwork socketCache (computing)Buffer solutionProcess (computing)Web 2.0Term (mathematics)PressureRead-only memoryGroup actionSource codeSemiconductor memoryCache (computing)BefehlsprozessorWeb pagePiEvent horizonEstimatorProcess capability indexProcess (computing)State of matterSoftwareBand matrixTerm (mathematics)Line (geometry)ImplementationTwitterMultiplication signTrailGroup actionProduct (business)Software repositoryMeasurementSoftware developerProof theoryPressureLimit (category theory)MiniDiscError messageKernel (computing)Maxima and minimaComputer fileWorkloadSampling (statistics)Plug-in (computing)MereologyBinary codeOcean currentGraph (mathematics)Scaling (geometry)Data storage deviceUtility softwareStatisticsWindowFile systemoutputPlanningRight angleComputer animation
19:06
System programmingSemiconductor memoryProcess (computing)PressureEvent horizon2 (number)WorkloadDefault (computer science)Right angleIndependence (probability theory)Interface (computing)Product (business)MereologyCache (computing)Standard deviationGroup actionData storage deviceState observerTask (computing)FrequencyLatent heatImplementationLoop (music)Library (computing)Mechanism designMultiplication signLimit (category theory)Ocean currentKernel (computing)Lecture/Conference
23:38
WebsiteSystem programmingLattice (order)Computer animation
Transcript: English(auto-generated)
00:05
So now Johannes Wiener will present about Senpai Automatic Memory Sizing for Containers, a tool to find out the correct C-group memory limitations. Yeah, so let's give him a hand.
00:27
Thank you. Yeah, my name's Johannes. I work on the Facebook kernel team, mostly on memory management and C-groups. And usually in the kernel space, Senpai is a user space tool,
00:41
although it's fairly low level, so that's something I've been working on. So this is a tool to automatically configure the memory limits and protection and stuff for C-groups and workloads that are running in C-groups. And the reason, the background for this is that
01:03
it's a fairly simple premise. We have large data centers and RAM is really expensive. And so we wanna pack our workloads, all the stuff that's running in the Facebook fleet, as tightly as possible to maximize our resources. And if we over-provision them,
01:22
we're wasting them, obviously. If we're under-provisioning them, we get stability problems or issues during peak loads. So yeah, the main thing is we wanna pack as tightly as possible. And in order to do that, we have to know exactly how much memory, how much resources a workload needs before we fire it up.
01:43
And we have a lot of workloads, so that's kind of hard to do just to cover it manually. But even if you wanted to do it manually, one problem is that it's actually really hard for people to estimate the size of their memory requirements.
02:03
So we have a lot of people that write high-level applications, and if you ask them, how much memory do you need to execute this, they don't really know. But even for somebody who works on the lower part of the stack, it's actually quite tricky to estimate the exact memory requirements of a workload.
02:22
I'm gonna show this in an example. So here is a simple kernel compile job, because I'm a kernel developer. I put it into a C group, not for control, just for accounting, just for tracking what it allocates. And then I let it run, and while it runs,
02:42
I'm just sampling the memory.current file of the C group, which just gives you the total memory consumption, everything that's allocated to that C group. And so after four minutes, it's done. And if you look at the peak consumption in our lock file,
03:01
it shows around 800 megabytes. That includes everything, compiler, the source tree, everything, the job runs at the end, the peak consumption is 800 megabytes. Now, I have a suspicion that this is not exactly the amount of memory that I do need. So I let it run again, I set a limit
03:21
of 600 megabytes first, and I let it run again, and it takes the exact same amount of time, right? So the workload would allocate 800 megabytes. It clearly doesn't need it, so what's going on there? And in order to understand, you have to look at the memory access distribution of a workload.
03:42
If you look at the graph on the bottom, I think it should be readable. There's the unique data that a workload accesses during its lifetime, and then on the Y axis, you see the access frequency, and not everything that is being allocated is used at the same frequency.
04:01
So if you look to the left where the access frequency is high, the compile job, that will be things like GCC, glibc, all the stuff that runs on every single source file, right? So it's pretty hard, every instruction basically is touching that memory to execute the next line.
04:21
And then as you move to the right, you get things like, for example, make the startup, or if you, in the case of the kernel, the configuration system, it gets parsed first when you start the make job, and then once it figures out which source files it needs to compile,
04:41
it doesn't touch that memory anymore. And then of course, after the source files themselves, as the compiler walks through the tree, it builds one C file into an object file, it never looks back, right? So, and what happened with when I set the limit to 600 megabytes, all I did, instead of when
05:04
the compiler moves on to the next source file, instead of allocating more memory to cache it, it just goes like, okay, I'm hitting the 600 megabyte limit, I gotta reclaim something, and it'll just reclaim the memory that was holding the previous source file that's not being used anymore, right? So, even while reducing the memory,
05:25
it can just basically time share a smaller amount of memory, and just use it sequentially. And so, when you see that, obviously the question is, how much can you reduce, and how much can you do this with multiple workloads?
05:41
Like, how far can you reduce the limit before you hit that knee, and you're gonna hit memory that's really frequently used? So, I can run it again, set it to 400 megs this time, and it's still kinda completing in the same amount of time and then the question is, how far can we go? At 300 megabytes, I eventually aborted the job
06:03
because it didn't look like it was gonna finish, and it was pretty IO bound the whole time, so after 10 minutes, I was like, okay, this is not gonna finish. Yeah, so takeaway from this is, it needs somewhere between 300 and 400 to just complete normally,
06:22
which is a lot less than the 800 megabytes that we initially thought. And obviously, this is a piece of data we would like to have for basically all Facebook jobs. If we look at this, the question is, how much memory are we actually wasting, right?
06:42
So, the tricky bit is to do something like this at scale. One problem is that a trial and error process like this is really tedious if you do it at scale, but the other problem is you can't really do this with a constantly changing software implementation
07:03
and also variable user activity, right? So the kernel job, it's the same files it's compiling every single time. I can run it as many times as I want, it's the same input over and over and I can just modify that one parameter and see what it does. But if we have a long-running service like a web server at Facebook
07:22
that is completely driven by user activity, it's actually really hard to, it's really, I mean, you can't do trial and error there. So, this is where Senpai comes in. And the basic idea behind Senpai is
07:43
you create artificial memory pressure on a workload and then you monitor the memory health as it's running to identify where you are on that graph that I showed earlier. Are you pretty much to the right? Are you just cutting off memory that is rarely used or not really reused, or are you cutting into that hot set on the left?
08:04
And now the question is, how do you identify memory health of millions of different applications? And so this is based on something I was talking about last year. This kernel feature called PSI,
08:21
which is their pressure metrics. And the way they work is they record the time that a process that is trying to run is having to wait on the way of some operation, waiting for resources that are congested. So for example, if you have a cold start
08:42
of an application that's never run before, you'll encounter a bunch of page falls, right? But those page falls, they would happen whether you have infinite memory or not, right? It's just never been accessed, it's never been cached. But if you wait on a page fault for something that was very recently kicked out of the cache,
09:01
it's called a refault, and something like that would not happen if you had infinite amounts of memory. So when a task enters a page fault and we can identify this was recently only evicted from the cache, we can record the time it takes for you to get that page back and record it as a stall event.
09:22
We can say this is time that only is being spent in the process because there is not enough resources. And so by doing this, we can basically profile productivity of any given task in the system. We can go like, this is spending X percent of its time waiting for resources or it's running really fast
09:42
and it's fine. And the reason we originally developed this was to choose this root cause regressions. We would have, we have machines where many things change during the day. Different parts of the entire software stack get updated and sometimes things run slower
10:01
and it's actually really hard to say why they're running slower. And there are some indications, you can look at the page fault rate, things like that, but you're not exactly sure what the exact root cause is. And so PSI was kind of developed to go like, you're waiting for IO, you're waiting for memory. For example, if the memory access pattern changed,
10:20
you're now waiting for memory and you can tell exactly, you're waiting 10%, you're waiting 20% of your total run time. So yeah, the regression quickly identified this is where the time is going was one reason. And the other thing was to fix problems with total overcommit and to automatically remedy those.
10:46
And this is something that, for example, UMD does when memory pressure gets too high and we're spending like double digits percent of the entire time just waiting on memory. Then we go like, okay, this is extreme,
11:00
just kill the workload. Now, this is good at the high end of pressure, but at the very low end, PSI is actually fairly sensitive so it can record events that take microseconds. And this is where Senpai makes use of it
11:22
because once we have something like PSI in place, what we can do is we just modify the C-group memory allowance continuously and then in a feedback loop monitor the PSI pressure. And this is how we can tell when we're approaching that knee, when we see when pressure kicks up
11:43
and then we can back off instantly. We know, okay, this is the line. And the idea is we apply enough pressure for PSI and the Senpai to detect but within the tolerance of the workload before latencies go up too far or throughput drops.
12:06
So this is the same kernel job run with Senpai and you can see the time is still around four minutes. I set it kind of aggressively so there's a couple extra seconds but for most batch workloads you probably wouldn't care.
12:21
And as you can see with recording the memory current, it takes about 340, 335 megabytes of memory. And obviously that's not, the memory consumption is not like a single value. In the graph you can see the blue line
12:40
is the memory current log file of completely unconstrained kernel builds. And you can see at the very beginning it seems to read a bunch of data into the cache that it just never ends up using again. And then with the red line you see Senpai putting pressure on it and it's cutting away a whole bunch of that memory
13:02
that is seemingly not needed for the entire duration of the workload. And so we put this on some web servers and Facebook and the blue line, the value doesn't really matter all that much. It's mostly indicating the requests per seconds
13:21
coming into those machines on average. And as you can see with the yellow line indicating the memory consumption of the web server software, it drops from 15 gigabytes to below 10. And the requests per seconds are unaffected.
13:43
So the load balancer doesn't see that the machines are struggling to handle a request, it just keeps giving them the same amount of work. But also interesting is not just the memory reduction or the seeming reduction in what we think it's using, but you can also see that when you look
14:01
at the yellow line to the left it's kind of noisy. And when you look to the right where Senpai kicks in, the memory footprint follows the load that the machine is experiencing. So it's not just the reduction, it's also giving much more accuracy. And that's something that's another project
14:24
we've been working on. Dan Schatzberg, who was talking about resource control yesterday, was also working on this. We have a whole bunch of widely deployed binaries that run on every single machine in Facebook.
14:41
And because they're relatively small compared to the host size, their exact footprint can vary and it doesn't affect the workload all that much, but obviously for development reasons they wanna know if they suddenly need more memory than before.
15:01
And so they were interested in using Senpai to get an exact measure on how much are they actually consuming, how much are they actually taking out of the resource pool. And we had one binary that runs periodically to collect a bunch of statistics and puts them into nice graphs.
15:20
Locks memory consumption, locks CPU utilization, all of that. And what they were using was they were looking at the RSS size of the main process to estimate how are we doing memory-wise, are we regressing, are we using more or less. And their own estimate was about that they're using 200 megabytes.
15:41
And we put all of this into a cgroup and put Senpai on it and it showed that their actual footprint was like seven times larger. I think it was like one and a half gigabytes or so. And because it was all the memory they were missing, they were touching files in the file system so they're allocating file system cache
16:03
that they're not including that RSS-based estimation. They're forking off collectors, they're using network. All of that memory is being tracked by cgroups but they weren't tracking this. And yeah, so the current state of Senpai
16:25
it's more or less, well it's kind of a proof of concept that's growing into a production piece of software. And so right now there's a Python implementation
16:41
and Dan's been working on a UMD plug-in to make it much easier to deploy. And there are a couple of plans that are more or less medium to long term. One part is the sampling between PSI sampling and making adjustments is following
17:01
a fairly short window right now. And the idea is to be able to learn from longer term trends so if there's a bad pressure event that indicates oh we're way too low on memory right now, it shouldn't forget about it in like two, three sampling periods down the line and just recall and have long term trend tracking
17:23
that it doesn't do. Then also compressed RAM instead of having to go to secondary storage if you're running too low on memory because it would allow us to be more aggressive with tuning the memory limit because if we tune a memory limit
17:41
too aggressively right now it means we have to go to disk before we detect the error and once you go to disk the minimum amount of time that you're waiting is like secondary storage IO which is pretty costly. So we have to converge fairly slowly and move slowly
18:00
and with compressed backing storage we could aggressively shrink memory and if it goes wrong it wouldn't be that costly but still detectable. Then there's a bunch of stuff on the kernel side we could do PSI annotations. For example if you're causing memory pressure that's causing more paging that is taking out of the IO bandwidth
18:22
so unrelated IOs that are not memory related could also be slowed down. That's also something that's not currently being tracked which it doesn't, right now the way we're using it is completely fine because we're applying pressure at a scale where the IO impact is almost negligible
18:40
but it would allow us to be more, all these things would allow us to move more aggressively and converge on the actual memory consumption faster. But yeah, so this is the GitHub repo for where the current Python implementation sits
19:01
if you wanna go check it out. And yeah, this is it. Questions?
19:23
Is it dependent on Tupperware or is it looking directly at Cgroups and does it use Cgroups V1 or V2? Oh it's, oh thanks, that's a good question. So I try to keep the dependencies very low especially on the Python thing. The Python implementation it's directly working on the Cgroup 2 interface.
19:41
And the reason it's Cgroup 2 is because there's no PSI in Cgroup 1. And another feature that it's using in the Cgroup interface is something called memory.high which is a memory limit that only throttles but doesn't oom kill, right? Because we would never want the senpai to cause kills.
20:02
We want it to be an undetected observer as much as possible. And so we only use memory.high which exists in Cgroup 2. And other than that it's Python standard library. There's not really, yeah.
20:33
Comment? How does it, basically like how tight is the loop have to be for it to apply memory pressure
20:42
on processes that are only running for microseconds? Oh so the current sampling period is, the default anyway is six seconds. So it reads pressure and monitors every second but it doesn't do adjustments more than every six seconds
21:02
because when you take away memory it's completely dependent on the workload when it will notice, right? You can take something away. It might be accessing the cache like a minute later and it's like, no you don't know. So it, right now it defaults to six seconds which seems to work pretty well in practice.
21:23
And yeah, that's something that could be sped up with if we have compressed RAM as a backing or as a secondary storage where you could just move more aggressively and if we make mistakes it's more forgiving.
21:41
Yeah so kind of a related question. Can you describe the refaulting behavior? And the related part is that if we're looking at something every six seconds or whatever the period is, like if the process was restarting every seven seconds perhaps, would that mean that all the.
22:04
Thanks, chef. I've forgotten my question now. Refaulting, refaulting. Yes so the refaulting mechanism is the kernel remembers when it's kicking out entries from the cache
22:21
and then when they come back then we can, first we can detect them. This has been kicked out very recently and somebody is reading it back immediately so we can tell there's an event that means the cache is kind of thrashing. And then PSI can measure how long it takes and we can then conclude like this is taking time
22:42
out of the productivity of the task. Independent of the process. Yes so the refault technically is a process independent thing, it's kind of a thing that the cache is experiencing but we can detect one individual task
23:01
or waiting for a specific cache entry to come back. So you can have one refault and you can have multiple tasks at the same time waiting for that thing and experiencing their own memory pressure. Well what's the second part of your question? I don't remember anymore. Okay, so that's it.
23:31
Thanks a lot.