We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback
00:00

Formal Metadata

Title
oomd
Subtitle
A userspace OOM killer
Title of Series
Number of Parts
50
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Running out of memory on a host is a particularly nasty scenario. In the Linux kernel, if memory is being overcommitted, it results in the kernel out-of-memory (OOM) killer kicking in. In this talk, Daniel Xu will cover why the Linux kernel OOM killer is surprisingly ineffective and how oomd, a newly opensourced userspace OOM killer, does a more effective and reliable job. Not only does the switch from kernel space to userspace result a more flexible solution, but it also directly translates to better resource utilization. His talk will also do a deep dive into the Linux kernel changes and improvements necessary for oomd to operate. Out of memory killing has historically happened inside kernel space. On a memory overcommitted linux system, malloc(2) and friends will never fail. However, if an application dereferences the returned pointer and the system has run out of physical memory, the linux kernel is forced to OOM kill one or more processes. This is typically a slow and painful process because the kernel spends an unbounded amount of time swapping in and out pages and evicting the page cache. Furthermore, configuring policy is not very flexible while being somewhat complicated. oomd aims to solve this problem in userspace. oomd leverages cgroupsv2 and newly exposed counters and statistics to monitor a system holistically. oomd takes corrective action in userspace before an OOM occurs in kernel space. Corrective action is configured via a flexible plugin system, in which custom code can be written. By default, this involves killing offending processes. This enables an unparalleled level of flexibility where each workload can have custom protection rules. Furthermore, time spent churning pages in kernelspace is minimized. In practice at Facebook, we've regularly seen 30 minute host lockups go away entirely.
24
Thumbnail
15:29
25
Thumbnail
21:21
32
44
SpacetimeSystem programmingDemonRead-only memoryOpen sourceKernel (computing)Mechanism designFreewareFluid staticsBinary codeKernel (computing)SpacetimeSemiconductor memorySlide ruleLink (knot theory)Service (economics)Patch (Unix)FacebookMechanism designResultantMultiplication signPhysical systemOrder of magnitudeCodeOpen sourceUniform resource locatorProjective planeLeakBitCartesian coordinate systemComputer animation
Semiconductor memoryConfiguration spaceNumberKernel (computing)Task (computing)Web pageStochastic kernel estimationLogicPhysical systemContext awarenessModulo (jargon)Group actionDeterminismProcess (computing)System programmingComputing platformControl flowVirtual machineBootingSemiconductor memoryProcess (computing)Cartesian coordinate systemKernel (computing)WorkloadMereologyCuboidServer (computing)Physical systemLink (knot theory)Greatest elementFreezingSoftware bugData managementTask (computing)SpacetimeService (economics)Web pageData centerMemory managementPointer (computer programming)Source codeComputer fileMultiplication signFacebookPressureGoodness of fitHome pageImplementationNumberOrder of magnitudeGame controllerTerm (mathematics)GravitationEvent horizonArithmetic progressionCodeOperator (mathematics)CASE <Informatik>Group actionPhysicalismFigurate numberBitVirtual machineCommitment schemeCache (computing)Array data structurePoint (geometry)Computing platformLeakComputer hardwareDemonSoftware developerIntegrated development environmentBoss CorporationContext awarenessExtension (kinesiology)FrequencyStochastic kernel estimationBuildingConfiguration spaceComputer animation
Mechanism designSystem programmingPlug-in (computing)Default (computer science)Physical systemCore dumpKernel (computing)Group actionPressureRead-only memoryVariety (linguistics)Computer multitaskingPersonal digital assistantWorkloadSemiconductor memoryMechanism designService (economics)CASE <Informatik>Operator (mathematics)Single-precision floating-point formatPhysical systemInterface (computing)Semiconductor memoryInheritance (object-oriented programming)WorkloadCartesian coordinate system2 (number)Game controllerConfiguration spaceComputing platformImplementationVirtual machineWritingCuboidGoodness of fitBitPressurePlanningDefault (computer science)Variety (linguistics)Plug-in (computing)Order (biology)Multiplication signOrder of magnitudeIterationDifferent (Kate Ryan album)Front and back endsComputer fileIndividualsoftwareIntermediate languageCore dumpLine (geometry)NumberMultitier architectureGroup actionStandard deviationArithmetic progressionSoftware bugSeries (mathematics)Kernel (computing)Context awarenessMultiplication1 (number)WordFlash memoryFormal languageProduct (business)Boom (sailing)Table (information)Link (knot theory)BuildingRothe-VerfahrenComputer animation
System programmingSemiconductor memoryResultantBit rateGraph (mathematics)Software bugExecution unitUtility softwareSemiconductor memoryMultiplication signFrequencyPressurePoint (geometry)Software testingExtension (kinesiology)CuboidOrder of magnitudeCartesian coordinate systemNumberMoving averageComputer animation
System programmingProcess (computing)Software bugInversion (music)Disk read-and-write headReading (process)PressureSemiconductor memoryFile systemMultiplication signSpecial unitary groupComputer animation
FacebookSystem programming
Transcript: English(auto-generated)
Hi, everyone. Thanks for showing up to listen to my talk. My name is Daniel. I work at Facebook on a team called kernel applications. The charter of our team is to pretty much make the Linux kernel more usable from the user space side of things. One of the projects
I work on is UMD. I want to apologize in advance if I sound weird. My throat feels a little funny. So if I need to cough, that's probably why. So what is UMD? UMD is a user space out of memory killer. And I'll go into a little bit more detail about what an UMD
killer is for those of you who are not familiar with that. It's mostly dependency free. And when I say dependency free, I mean if you have a static binary and it's running, you don't really need any other system services to be available. All you really need is the latest Linux kernel. And when I say latest, I mean really latest and some special
patches that Facebook's Johannes has created. But I think the patches are going to be upstream pretty soon. So in a couple months, you should be able to pull the latest upstream kernel and things should be fine. UMD, I posit, is deterministic, faster and more flexible than the kernel UMKiller. And I'll talk about that in later slides as well.
UMD is also open source. It's licensed under GPL 2. You can view the code on the first link and then you can read the very nice documentation that a guy named Thomas from Facebook wrote on the second link. So the agenda for this talk, it's fairly short, actually. I'm going to go over the motivation mechanism and then results behind UMD. And
then I'll leave some time at the end for questions. And if no one has questions, you can get some of your day back. So motivation. So why create UMD? So first, I think we need to back up a little bit and go over why scenarios actually happen. So on most Linux hosts, you typically overcommit memory. And overcommitting memory pretty much means
that memory allocations do not fail. Sometimes it can fail, depending on how you have configured memory overcommit. The thinking behind memory overcommit is that most applications that allocate memory don't necessarily use all of the memory that they allocate. The classic example is sparse arrays. So an application might allocate a bunch of memory for a big
array and then not actually fill up all the entries in that array. However, just because the kernel returns you a pointer and not an old pointer doesn't mean the memory is always available. So if the system actually runs out of physical memory, something has to happen. The kernel will typically try to free up some pages that it can, for example,
try to flush any dirty pages in page cache or try to swap some anonymous memory out to swap. But if that fails, the kernel has to come and do something. And that usually means it will go pick a process, usually the biggest one, and just kill it. There's a couple of ways you can configure kernel loom killing. One of them is I've listed a few of them. There's a bunch of things that you can turn to tell the kernel what
to kill in an event of an oom. Personally, I think it's a little bit confusing. The knobs are all pretty much numbers. I think some of them go from negative 16 to 15 to positive 15. Some of them go from 0 to 1,000. In my opinion, it's not very intuitive, but if you are reading the oom underscore kill dot C file, I think it makes a good
deal of sense, but that's only if you know the implementation. So from the user side of things, I don't think it's very ergonomic and I think there could be a better way to do it. The kernel loom killer is also pretty slow to act. If you're here for the resource control at Facebook talk, they alluded to this somewhat briefly. By the time
the kernel loom killer actually kicks in, things are already probably too late for user space. That's usually because the kernel killer tries to protect kernel health. So it doesn't really care too much about what user space is doing so long as the kernel is making forward progress. One thing that we have seen at Facebook is that under heavy memory pressure, the kernel executes some instructions. It will try to fault in
memory page to access memory and then after that instruction is done, it will fault out that memory page and fault back in the code page and then it kind of just repeats over and over. So an operation that usually doesn't take that much time starts taking like five to ten minutes and then all the while user space is live locked and your application health is died. The kernel
also doesn't have any good context into the logical composition of a system. So for example, there could be two processes on a system that you really don't want killed. So you really want to kill at the same time. So for example, if one dies, the other really should be dead because it doesn't do anything else without the first one. Or there might be another case where two
processes should never be killed at the same time. So one process should always be alive and the other should always be dead or something. It's hard to tell the kernel to do this. There's also not a really great way to customize kill actions. You can't really say, hey, I don't actually want you to sick kill or sick term something. I want you to send an RPC or I want you to send some kind of notification. For example, it's like Docker D. You don't
actually want to kill Docker D. You want Docker D to start reaping containers. There's no great way to do that other than using something like event FD. But again, that falls into the first thing I mentioned where it's the kernel's pretty slow in reacting to these kind of scenarios. It's also the kernel killer is also somewhat non-deterministic. You have to turn all the knobs in
procfs. It's not to say it's impossible to get it right, but if you have a service that forks a lot of processes off, you're kind of racing against the system to set the correct oom adjust knobs or whatnot. So oom test. Facebook runs into a bunch of out-of-memory problems. I've listed some of them off here. One of the platforms that suffers out-of-memory
issues is our build and test platform that we call Sandcastle. Essentially every time Facebook developer uploads some code to be reviewed, Sandcastle builds and tests that code. And Sandcastle typically co-locates these build jobs on to a small group of shared hosts. Building arbitrary, not really arbitrary, but like building code can sometimes lead
to issues because linking takes a lot of memory. Plus if you build everything in tempfs, which Sandcastle does happen to do, it can eat up a lot of memory. And so bugs and accidents do happen. And when they do, you can oom a box. And ideally you don't want to take down the whole box for an extended period of time. We also have a container and service platform
called Tupperware where developers and operators services can run containers, much like Kubernetes. Bugs do happen, memory leaks happen. And if you use a shared pool of hardware, you don't really want to take out your neighbors because one developer from one service had a somewhat nasty bug. We also have a somewhat more esoteric environment. We have a commodity top of rack switches that we call
FBoss. It's very resource constrained environment. The boxes only have 8 gigs of RAM. And so it's really easy to oom the box, actually. So, for example, if Chef comes along and runs an update, it'll do a bunch of I.O. and use a bunch of memory. And then maybe another package update happens at the same time asynchronously. And then maybe the rack switch is serving a lot of traffic. It's really
easy for the box to run out of memory. And in these cases, you don't want the host to lock up or freeze because you'll take down a whole rack. They're not working for an entire rack. So you'd like to gracefully shut down some things such as Chef or the package update. And pretty much most multi-tenant platforms suffer from out of memory issues because bugs and mistakes do happen.
A lot of these platforms choose to turn on panic on oom because in these scenarios, you don't want something non-deterministic. For example, if you're running, again, go back to the example of Docker, you don't want to accidentally kill the management demon and let the tasks or containers run without any management oversight. That could lead to some pretty nasty
bugs. So some of these services, they'll panic on oom. So if the host runs out of memory, it'll shut down the entire box. And then these containers will get reassigned to another box somewhere else. While this is logically pretty correct, it's suboptimal in that it wastes resources. There's just servers and data center spinning, rebooting and not really doing much else. So there could be a better
solution. OomD is also used for FPTags2. Tejan and Johannes talked about it briefly in their earlier talk, if you weren't here to see it. The summary is that they want full work conserving OS resource isolation across applications. In short, it means two workloads should be able to coexist on a machine. If one starts doing bad things, the
other one shouldn't really be affected. And oomD plays a part in rectifying these pathological cases where the kernel isn't able to protect everything. There's a bunch of links at the bottom if you want to check it out later. A lot of cool stuff going on there. So moving into mechanisms. So how does oomD actually work?
So oomD heavily leverages a new kernel feature called PSI that Johannes wrote. And essentially what PSI tells you is it gives you a number between 0 and 100 that tells you how much wall clock time you have lost due to resource surges. If it says 0, it means you have not lost any wall clock time. That means your workload should be theoretically healthy, barring any bugs you may have introduced yourself. 100 means you're not making any forward
progress and something is terribly wrong with resources on your system. So oomD monitors or keeps like a time series of the memory pressure or IO pressure. And then if it's trending upwards or trending really high, it'll start performing correcting actions. At the core of oomD is the plugin system. So that means it's designed so that people
can customize detection and kill actions. We provide a default oomD detector and oomKiller plugin that is pretty sensible and works across a variety of platforms. We deployed it to a bunch of tiers and they worked really great out of the box. If you want to change it, you can subclass these plugins and override the methods you
want. Pretty standard behavior. OomD doesn't just monitor memory. It monitors IO pressure. And that's because PSI actually covers IO pressure as well. OomD also monitors swap because swap is pretty essential for oomD to have enough runway to detect memory pressure. If you don't have swap, it tends to be
a lot easier to detect memory pressure. If you don't have swap, it tends to be a lot easier to detect memory pressure. If you don't have swap, it tends to be a lot So this is the original oomD config. As you can see, it's mostly just JSON. It's pretty straightforward. We're monitoring system that slice. We're monitoring the pretty nonessential system services. And we have a kill list, which
is in order. So what it says is if the host ooms, if the host runs out of memory, you can never kill SSHD because you don't want to lose SSH access. And then we're using default oomD and oomKiller plugins. If you have custom plugins, you would put the name in there.
OomD works pretty well. We have a bunch of places and helps prevent a lot of really bad pathological cases when a host runs out of memory. However, as we've onboarded more and more users and experience different use cases, it's become apparent that we need to iterate on oomD. We're changing the config file language. We're changing how the thing works
in the backend. We're still iterating and playing around with the details, but I think we're on to something really nice here and it works really great in helping protect hosts from ooms. What I have here is the oomD2 config. This is the next iteration on the config. It's mostly pseudocode here. I've circled the pseudocode in a yellow box. What this is essentially saying is that if
the workload slows down by more than 5%, or if the system that's sliced slows down by more than 40%, please kill something that hogs a lot of memory in the system that's sliced. In other words, if your workload experiences a little bit of slowdown, please do something about it. If the nonessential stuff experiences a good amount of slowdown, it doesn't really matter to us as long as the
workload is healthy. I'm going to flash by to the actual config. This is the actual config that would work. I'm not going to put it up because then you'll just squint at it. It's not really important because what it essentially says is what I have outlined here. You might have noticed that the actual config is pretty long and verbose. That's because this oomD2 config isn't
necessarily meant to be written by end users frequently. It's designed with two use cases in mind. The first use case is that a workload-aware application such as the orchestration layer or the control plan of a container platform, it would dynamically generate these oomD2 configs such that it protects the workload as best it can. And the second use case is, say, you're not running a
shared multi-tenant service. You're running a single platform thing with your custom software on bare metal. And one operator might sit down for a couple days and write a config that works well across all these machines. And then, so you really don't need people tweaking it every day. It's not really meant to be used by, like, desktop Linux users. For example, I
wouldn't really put oomD on my personal machine as I don't really do things that oom my box other than sometimes build things that take too much memory to generate. And the second use case is Note that interesting implementation detail, not that it's super important because it's just details, is that there's an intermediate representation layer in oomD. We're not fully locked into JSON here. We could
theoretically spend a couple hours and add in a YAML interface or maybe an IP tables-like interface where you can have a config all in one line. Super concise. So, results. How well does oomD actually work? So, here we have a graph of memory usage over time on a single host. And this is a host from one of our build
and test. This is in our build and test fleet for sandcastles, one of the hosts. You can see at some point a build starts and the memory spikes really high. And another point, the memory dips really fast. The dip is because oomD came inside to kill something because it detected memory pressure was too high. And prevented the box from being locked up for an extended period of time and essentially not being
utilized. You might notice for those who are very perceptive, the Y axis is missing labels and units. That's because the lawyer said I couldn't have numbers. Yeah. But I'm sure you can figure out what this means. This is another graph of the panic on oom rate before and after an oomD rollout. You notice the Y axis doesn't have numbers. This makes more sense. I'm
not allowed to expose how many hosts we have running this kind of stuff. But you can see that the rates at which hosts panic on oom kind of dip pretty significantly at a certain point in time. And that's when an oomD rollout occurred. And it was 8 a.m. on a Friday. You have a full day to figure it out if there's bugs.
So, yeah, there's time for questions if anyone has any. Otherwise, you get some of your day back. Yeah. Do we need the mic? In one of your first slides, you mentioned ButterFS. Does it require ButterFS or can you use it without it? You do not need ButterFS, no.
Yeah, it should be file system agnostic. Barring priority inversions that we've hit. Yeah, one of the interesting bugs was the mmap-sem thing. It puts processes into uninterruptible sleep under high memory pressure because it holds mmap-sem and tries to do
the read head thing. And even though oomD tries to kill it, it won't die because it's stuck doing IO. Which is kind of nasty. But I think it's been fixed, yeah. All right. Awesome.