oomd
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 50 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/43114 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
All Systems Go! 201824 / 50
1
4
11
12
13
14
16
17
19
23
24
25
29
30
32
34
35
39
40
41
43
44
50
00:00
SpacetimeSystem programmingDemonRead-only memoryOpen sourceKernel (computing)Mechanism designFreewareFluid staticsBinary codeKernel (computing)SpacetimeSemiconductor memorySlide ruleLink (knot theory)Service (economics)Patch (Unix)FacebookMechanism designResultantMultiplication signPhysical systemOrder of magnitudeCodeOpen sourceUniform resource locatorProjective planeLeakBitCartesian coordinate systemComputer animation
01:47
Semiconductor memoryConfiguration spaceNumberKernel (computing)Task (computing)Web pageStochastic kernel estimationLogicPhysical systemContext awarenessModulo (jargon)Group actionDeterminismProcess (computing)System programmingComputing platformControl flowVirtual machineBootingSemiconductor memoryProcess (computing)Cartesian coordinate systemKernel (computing)WorkloadMereologyCuboidServer (computing)Physical systemLink (knot theory)Greatest elementFreezingSoftware bugData managementTask (computing)SpacetimeService (economics)Web pageData centerMemory managementPointer (computer programming)Source codeComputer fileMultiplication signFacebookPressureGoodness of fitHome pageImplementationNumberOrder of magnitudeGame controllerTerm (mathematics)GravitationEvent horizonArithmetic progressionCodeOperator (mathematics)CASE <Informatik>Group actionPhysicalismFigurate numberBitVirtual machineCommitment schemeCache (computing)Array data structurePoint (geometry)Computing platformLeakComputer hardwareDemonSoftware developerIntegrated development environmentBoss CorporationContext awarenessExtension (kinesiology)FrequencyStochastic kernel estimationBuildingConfiguration spaceComputer animation
08:35
Mechanism designSystem programmingPlug-in (computing)Default (computer science)Physical systemCore dumpKernel (computing)Group actionPressureRead-only memoryVariety (linguistics)Computer multitaskingPersonal digital assistantWorkloadSemiconductor memoryMechanism designService (economics)CASE <Informatik>Operator (mathematics)Single-precision floating-point formatPhysical systemInterface (computing)Semiconductor memoryInheritance (object-oriented programming)WorkloadCartesian coordinate system2 (number)Game controllerConfiguration spaceComputing platformImplementationVirtual machineWritingCuboidGoodness of fitBitPressurePlanningDefault (computer science)Variety (linguistics)Plug-in (computing)Order (biology)Multiplication signOrder of magnitudeIterationDifferent (Kate Ryan album)Front and back endsComputer fileIndividualsoftwareIntermediate languageCore dumpLine (geometry)NumberMultitier architectureGroup actionStandard deviationArithmetic progressionSoftware bugSeries (mathematics)Kernel (computing)Context awarenessMultiplication1 (number)WordFlash memoryFormal languageProduct (business)Boom (sailing)Table (information)Link (knot theory)BuildingRothe-VerfahrenComputer animation
13:11
System programmingSemiconductor memoryResultantBit rateGraph (mathematics)Software bugExecution unitUtility softwareSemiconductor memoryMultiplication signFrequencyPressurePoint (geometry)Software testingExtension (kinesiology)CuboidOrder of magnitudeCartesian coordinate systemNumberMoving averageComputer animation
14:21
System programmingProcess (computing)Software bugInversion (music)Disk read-and-write headReading (process)PressureSemiconductor memoryFile systemMultiplication signSpecial unitary groupComputer animation
15:23
FacebookSystem programming
Transcript: English(auto-generated)
00:06
Hi, everyone. Thanks for showing up to listen to my talk. My name is Daniel. I work at Facebook on a team called kernel applications. The charter of our team is to pretty much make the Linux kernel more usable from the user space side of things. One of the projects
00:23
I work on is UMD. I want to apologize in advance if I sound weird. My throat feels a little funny. So if I need to cough, that's probably why. So what is UMD? UMD is a user space out of memory killer. And I'll go into a little bit more detail about what an UMD
00:41
killer is for those of you who are not familiar with that. It's mostly dependency free. And when I say dependency free, I mean if you have a static binary and it's running, you don't really need any other system services to be available. All you really need is the latest Linux kernel. And when I say latest, I mean really latest and some special
01:02
patches that Facebook's Johannes has created. But I think the patches are going to be upstream pretty soon. So in a couple months, you should be able to pull the latest upstream kernel and things should be fine. UMD, I posit, is deterministic, faster and more flexible than the kernel UMKiller. And I'll talk about that in later slides as well.
01:20
UMD is also open source. It's licensed under GPL 2. You can view the code on the first link and then you can read the very nice documentation that a guy named Thomas from Facebook wrote on the second link. So the agenda for this talk, it's fairly short, actually. I'm going to go over the motivation mechanism and then results behind UMD. And
01:42
then I'll leave some time at the end for questions. And if no one has questions, you can get some of your day back. So motivation. So why create UMD? So first, I think we need to back up a little bit and go over why scenarios actually happen. So on most Linux hosts, you typically overcommit memory. And overcommitting memory pretty much means
02:02
that memory allocations do not fail. Sometimes it can fail, depending on how you have configured memory overcommit. The thinking behind memory overcommit is that most applications that allocate memory don't necessarily use all of the memory that they allocate. The classic example is sparse arrays. So an application might allocate a bunch of memory for a big
02:22
array and then not actually fill up all the entries in that array. However, just because the kernel returns you a pointer and not an old pointer doesn't mean the memory is always available. So if the system actually runs out of physical memory, something has to happen. The kernel will typically try to free up some pages that it can, for example,
02:40
try to flush any dirty pages in page cache or try to swap some anonymous memory out to swap. But if that fails, the kernel has to come and do something. And that usually means it will go pick a process, usually the biggest one, and just kill it. There's a couple of ways you can configure kernel loom killing. One of them is I've listed a few of them. There's a bunch of things that you can turn to tell the kernel what
03:03
to kill in an event of an oom. Personally, I think it's a little bit confusing. The knobs are all pretty much numbers. I think some of them go from negative 16 to 15 to positive 15. Some of them go from 0 to 1,000. In my opinion, it's not very intuitive, but if you are reading the oom underscore kill dot C file, I think it makes a good
03:23
deal of sense, but that's only if you know the implementation. So from the user side of things, I don't think it's very ergonomic and I think there could be a better way to do it. The kernel loom killer is also pretty slow to act. If you're here for the resource control at Facebook talk, they alluded to this somewhat briefly. By the time
03:41
the kernel loom killer actually kicks in, things are already probably too late for user space. That's usually because the kernel killer tries to protect kernel health. So it doesn't really care too much about what user space is doing so long as the kernel is making forward progress. One thing that we have seen at Facebook is that under heavy memory pressure, the kernel executes some instructions. It will try to fault in
04:02
memory page to access memory and then after that instruction is done, it will fault out that memory page and fault back in the code page and then it kind of just repeats over and over. So an operation that usually doesn't take that much time starts taking like five to ten minutes and then all the while user space is live locked and your application health is died. The kernel
04:22
also doesn't have any good context into the logical composition of a system. So for example, there could be two processes on a system that you really don't want killed. So you really want to kill at the same time. So for example, if one dies, the other really should be dead because it doesn't do anything else without the first one. Or there might be another case where two
04:40
processes should never be killed at the same time. So one process should always be alive and the other should always be dead or something. It's hard to tell the kernel to do this. There's also not a really great way to customize kill actions. You can't really say, hey, I don't actually want you to sick kill or sick term something. I want you to send an RPC or I want you to send some kind of notification. For example, it's like Docker D. You don't
05:01
actually want to kill Docker D. You want Docker D to start reaping containers. There's no great way to do that other than using something like event FD. But again, that falls into the first thing I mentioned where it's the kernel's pretty slow in reacting to these kind of scenarios. It's also the kernel killer is also somewhat non-deterministic. You have to turn all the knobs in
05:20
procfs. It's not to say it's impossible to get it right, but if you have a service that forks a lot of processes off, you're kind of racing against the system to set the correct oom adjust knobs or whatnot. So oom test. Facebook runs into a bunch of out-of-memory problems. I've listed some of them off here. One of the platforms that suffers out-of-memory
05:43
issues is our build and test platform that we call Sandcastle. Essentially every time Facebook developer uploads some code to be reviewed, Sandcastle builds and tests that code. And Sandcastle typically co-locates these build jobs on to a small group of shared hosts. Building arbitrary, not really arbitrary, but like building code can sometimes lead
06:03
to issues because linking takes a lot of memory. Plus if you build everything in tempfs, which Sandcastle does happen to do, it can eat up a lot of memory. And so bugs and accidents do happen. And when they do, you can oom a box. And ideally you don't want to take down the whole box for an extended period of time. We also have a container and service platform
06:22
called Tupperware where developers and operators services can run containers, much like Kubernetes. Bugs do happen, memory leaks happen. And if you use a shared pool of hardware, you don't really want to take out your neighbors because one developer from one service had a somewhat nasty bug. We also have a somewhat more esoteric environment. We have a commodity top of rack switches that we call
06:43
FBoss. It's very resource constrained environment. The boxes only have 8 gigs of RAM. And so it's really easy to oom the box, actually. So, for example, if Chef comes along and runs an update, it'll do a bunch of I.O. and use a bunch of memory. And then maybe another package update happens at the same time asynchronously. And then maybe the rack switch is serving a lot of traffic. It's really
07:02
easy for the box to run out of memory. And in these cases, you don't want the host to lock up or freeze because you'll take down a whole rack. They're not working for an entire rack. So you'd like to gracefully shut down some things such as Chef or the package update. And pretty much most multi-tenant platforms suffer from out of memory issues because bugs and mistakes do happen.
07:23
A lot of these platforms choose to turn on panic on oom because in these scenarios, you don't want something non-deterministic. For example, if you're running, again, go back to the example of Docker, you don't want to accidentally kill the management demon and let the tasks or containers run without any management oversight. That could lead to some pretty nasty
07:41
bugs. So some of these services, they'll panic on oom. So if the host runs out of memory, it'll shut down the entire box. And then these containers will get reassigned to another box somewhere else. While this is logically pretty correct, it's suboptimal in that it wastes resources. There's just servers and data center spinning, rebooting and not really doing much else. So there could be a better
08:00
solution. OomD is also used for FPTags2. Tejan and Johannes talked about it briefly in their earlier talk, if you weren't here to see it. The summary is that they want full work conserving OS resource isolation across applications. In short, it means two workloads should be able to coexist on a machine. If one starts doing bad things, the
08:22
other one shouldn't really be affected. And oomD plays a part in rectifying these pathological cases where the kernel isn't able to protect everything. There's a bunch of links at the bottom if you want to check it out later. A lot of cool stuff going on there. So moving into mechanisms. So how does oomD actually work?
08:40
So oomD heavily leverages a new kernel feature called PSI that Johannes wrote. And essentially what PSI tells you is it gives you a number between 0 and 100 that tells you how much wall clock time you have lost due to resource surges. If it says 0, it means you have not lost any wall clock time. That means your workload should be theoretically healthy, barring any bugs you may have introduced yourself. 100 means you're not making any forward
09:00
progress and something is terribly wrong with resources on your system. So oomD monitors or keeps like a time series of the memory pressure or IO pressure. And then if it's trending upwards or trending really high, it'll start performing correcting actions. At the core of oomD is the plugin system. So that means it's designed so that people
09:23
can customize detection and kill actions. We provide a default oomD detector and oomKiller plugin that is pretty sensible and works across a variety of platforms. We deployed it to a bunch of tiers and they worked really great out of the box. If you want to change it, you can subclass these plugins and override the methods you
09:41
want. Pretty standard behavior. OomD doesn't just monitor memory. It monitors IO pressure. And that's because PSI actually covers IO pressure as well. OomD also monitors swap because swap is pretty essential for oomD to have enough runway to detect memory pressure. If you don't have swap, it tends to be
10:02
a lot easier to detect memory pressure. If you don't have swap, it tends to be a lot easier to detect memory pressure. If you don't have swap, it tends to be a lot So this is the original oomD config. As you can see, it's mostly just JSON. It's pretty straightforward. We're monitoring system that slice. We're monitoring the pretty nonessential system services. And we have a kill list, which
10:20
is in order. So what it says is if the host ooms, if the host runs out of memory, you can never kill SSHD because you don't want to lose SSH access. And then we're using default oomD and oomKiller plugins. If you have custom plugins, you would put the name in there.
10:43
OomD works pretty well. We have a bunch of places and helps prevent a lot of really bad pathological cases when a host runs out of memory. However, as we've onboarded more and more users and experience different use cases, it's become apparent that we need to iterate on oomD. We're changing the config file language. We're changing how the thing works
11:02
in the backend. We're still iterating and playing around with the details, but I think we're on to something really nice here and it works really great in helping protect hosts from ooms. What I have here is the oomD2 config. This is the next iteration on the config. It's mostly pseudocode here. I've circled the pseudocode in a yellow box. What this is essentially saying is that if
11:23
the workload slows down by more than 5%, or if the system that's sliced slows down by more than 40%, please kill something that hogs a lot of memory in the system that's sliced. In other words, if your workload experiences a little bit of slowdown, please do something about it. If the nonessential stuff experiences a good amount of slowdown, it doesn't really matter to us as long as the
11:42
workload is healthy. I'm going to flash by to the actual config. This is the actual config that would work. I'm not going to put it up because then you'll just squint at it. It's not really important because what it essentially says is what I have outlined here. You might have noticed that the actual config is pretty long and verbose. That's because this oomD2 config isn't
12:00
necessarily meant to be written by end users frequently. It's designed with two use cases in mind. The first use case is that a workload-aware application such as the orchestration layer or the control plan of a container platform, it would dynamically generate these oomD2 configs such that it protects the workload as best it can. And the second use case is, say, you're not running a
12:22
shared multi-tenant service. You're running a single platform thing with your custom software on bare metal. And one operator might sit down for a couple days and write a config that works well across all these machines. And then, so you really don't need people tweaking it every day. It's not really meant to be used by, like, desktop Linux users. For example, I
12:41
wouldn't really put oomD on my personal machine as I don't really do things that oom my box other than sometimes build things that take too much memory to generate. And the second use case is Note that interesting implementation detail, not that it's super important because it's just details, is that there's an intermediate representation layer in oomD. We're not fully locked into JSON here. We could
13:00
theoretically spend a couple hours and add in a YAML interface or maybe an IP tables-like interface where you can have a config all in one line. Super concise. So, results. How well does oomD actually work? So, here we have a graph of memory usage over time on a single host. And this is a host from one of our build
13:21
and test. This is in our build and test fleet for sandcastles, one of the hosts. You can see at some point a build starts and the memory spikes really high. And another point, the memory dips really fast. The dip is because oomD came inside to kill something because it detected memory pressure was too high. And prevented the box from being locked up for an extended period of time and essentially not being
13:42
utilized. You might notice for those who are very perceptive, the Y axis is missing labels and units. That's because the lawyer said I couldn't have numbers. Yeah. But I'm sure you can figure out what this means. This is another graph of the panic on oom rate before and after an oomD rollout. You notice the Y axis doesn't have numbers. This makes more sense. I'm
14:03
not allowed to expose how many hosts we have running this kind of stuff. But you can see that the rates at which hosts panic on oom kind of dip pretty significantly at a certain point in time. And that's when an oomD rollout occurred. And it was 8 a.m. on a Friday. You have a full day to figure it out if there's bugs.
14:22
So, yeah, there's time for questions if anyone has any. Otherwise, you get some of your day back. Yeah. Do we need the mic? In one of your first slides, you mentioned ButterFS. Does it require ButterFS or can you use it without it? You do not need ButterFS, no.
14:43
Yeah, it should be file system agnostic. Barring priority inversions that we've hit. Yeah, one of the interesting bugs was the mmap-sem thing. It puts processes into uninterruptible sleep under high memory pressure because it holds mmap-sem and tries to do
15:00
the read head thing. And even though oomD tries to kill it, it won't die because it's stuck doing IO. Which is kind of nasty. But I think it's been fixed, yeah. All right. Awesome.