State of systemd @ Facebook
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 50 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/43125 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
All Systems Go! 201813 / 50
1
4
11
12
13
14
16
17
19
23
24
25
29
30
32
34
35
39
40
41
43
44
50
00:00
SpacetimeSystem programmingState of matterVideo trackingObservational studyCASE <Informatik>Instance (computer science)CASE <Informatik>Observational studyProduct (business)BitArithmetic progressionFacebookPhysical systemComputer animation
00:38
System programmingPhysical systemService (economics)Operating systemConfiguration managementDigitizingNumberComputing platformVirtual machineComputer animation
01:13
Office suiteDisintegrationWave packetPhysical systemOffice suiteService (economics)Direction (geometry)MultilaterationArithmetic progressionComputer animation
02:08
System programmingVideo trackingProcess (computing)Reverse engineeringCorporate NetworkSoftware testingKernel (computing)EmpennageFluid staticsPhysical systemChainDisintegrationNetwork switching subsystemArithmetic progressionPatch (Unix)BootingSoftwareKernel (computing)Software testingConnected spacePhysical systemTerm (mathematics)Module (mathematics)Hacker (term)Right angleTraffic reportingSoftware bugData managementOperator (mathematics)Virtual machineError messageRandomizationSoftware developerComputer programmingMultiplication signProcess (computing)Revision controlDatabase transactionCuboidOperating systemSystem softwareOpen sourceSet (mathematics)Cartesian coordinate systemBitTrailLibrary (computing)SynchronizationBuildingFluid staticsRun time (program lifecycle phase)Source codeSound effectMixed realityProjective planeLevel (video gaming)NumberNormal (geometry)Commitment schemeMereologyOpen setDifferent (Kate Ryan album)Streaming mediaFault-tolerant systemTheory of relativityWorkstation <Musikinstrument>SineFeedbackNatural numberState of matterSoftware repositoryPoint (geometry)MIDIFirst-person shooterDependent and independent variablesInstance (computer science)Mathematical analysisChainFacebookFlow separationExecution unitBus (computing)Latent heatService (economics)Default (computer science)MathematicsEllipseComputer animation
10:25
System programmingData storage deviceDemonProcess (computing)Read-only memoryGame controllerWrapper (data mining)Data modelMetric systemBuildingService (economics)BitProper mapService (economics)BuildingPhysical systemSingle-precision floating-point formatFacebookMedical imagingCodeDemonMetric systemComputing platformLine (geometry)Portable communications deviceObject modelSpacetimeFunction (mathematics)Computer fileRight angleData storage deviceWordConnected spaceSet (mathematics)Doubling the cubeGame controllerPrincipal ideal domainLimit (category theory)Bit rateData compressionProcess (computing)ImplementationMultiplication signReal numberTelecommunicationSound effectType theoryState diagramData managementSemiconductor memoryLibrary (computing)WritingCASE <Informatik>Point (geometry)View (database)Mobile appData conversionInteractive televisionExecution unitDefault (computer science)Open sourceStreaming mediaoutputFunctional (mathematics)Flow separationBus (computing)Group actionKey (cryptography)WebsiteDirect numerical simulationInheritance (object-oriented programming)Radio-frequency identificationObservational studyDomain nameCurveComputer animation
17:07
System programmingObservational studyCASE <Informatik>Computer multitaskingPhysical systemDefault (computer science)Read-only memoryKernel (computing)Proper mapHacker (term)SpacetimeProcess (computing)Set (mathematics)CASE <Informatik>Observational studyInjektivitätBinary filePOKEWeightCodebuchCodeAlgorithmGoodness of fitBitProcess (computing)Open sourceVirtual machineFacebookPhysical systemSemiconductor memoryDatabaseService (economics)Set (mathematics)Program slicingSpacetimeGraph (mathematics)Kernel (computing)Latent heatDifferent (Kate Ryan album)CuboidInformationLink (knot theory)Function (mathematics)Default (computer science)PlanningServer (computing)Group actionExecution unitInsertion lossWaveStructural loadMultiplication signRadio-frequency identificationIntegrated development environmentComputer animation
22:21
System programmingMultiplication signBEEPComputer animation
22:45
CASE <Informatik>Observational studySystem programmingComputer multitaskingPhysical systemDefault (computer science)Read-only memoryKernel (computing)CASE <Informatik>DemonConnected spaceState of matterProcess (computing)Multiplication signComputer configurationSet (mathematics)Instance (computer science)CodebuchService (economics)Type theorySerial portData storage deviceComputer animation
25:29
System programming
Transcript: English(auto-generated)
00:07
Hello everybody, my name is Davide, I'm a production engineer at Facebook, and we'll be talking about what we've been doing with systemd in the past year or so. So, to begin with, this doesn't work. The agenda for today, we'll start with a quick recap of the story so far,
00:24
we'll talk a bit about our progress in tracking upstream internally, we'll discuss a few instances where we were able to leverage systemd to do something especially interesting, and finally we'll close with some case studies and interesting stories. So, without further ado, we have a lot of machines,
00:42
we have hundreds of thousands of machines and seven digit numbers of containers, all of these machines run CentOS, and I'm on the operating systems team, my team is responsible for maintaining CentOS on this infrastructure, and in general, everything that's related to that, so we maintain packaging, the configuration management system,
01:01
and in general, we maintain the bare metal experience, we provide a platform that other services run either directly on bare metal, or run on top of the container platform, which runs on top of bare metal. So, we've been on CentOS 7 for a while now, when I came here last year, we still had a few containers still running on CentOS 6, now everything is on 7, both host and containers,
01:22
which also means that by now, pretty much everybody had exposure to systemd for the good portion of two years, if not more. And I mean not just people on my team or working closely with the systems, but pretty much every engineer that is deploying a service or dealing with a service directly. And the other thing we did was we, myself and my team,
01:41
ended up traveling to pretty much every engineering office, giving trainings, giving talks, trying to make sure people were aware of who systemd was, how to interact with it, where the documentation was, how to ask questions, and in turn, this has led to a lot of people reach out to us and try to either leverage new features they found in systemd, they thought were interesting, things they read online,
02:01
they thought they could use, and wanted in general to integrate more tightly. And I'm going to try and cover a few of these later. But before I do that, let's talk quickly about our progress so far. So, we ran CentOS 7, but we don't run systemd from CentOS 7, because CentOS 7 ships with 219.
02:20
We back-processed AMD from Fedora. Last year, we were on a mix of 234 and 235, and I'm happy to say that we managed to more or less stay in track with upstream, so we managed to go through 235, 236, 237, 238, and these days we run 239 on the vast majority of the fleet. And generally speaking, we would do these roughly in sync with upstream,
02:42
so we would be running whatever is the latest stable, or the stable minus one, depending on whether there's pending issues. We don't backport just systemd, we also backport things related to systemd, notably util Linux, which happens to be a runtime dependency, and the build stack, because we do need to be able to build systemd.
03:01
These backports are taken, we take the source RPMs straight from Fedora, manage them a bit, and then build them internally. We publish the backports on GitHub on that repo, that has been around for a while now, so if you happen to need to run this stack on your CentOS 7 system, you're welcome to use that. Doing this for a while, we've now become fairly comfortable with some workflows.
03:24
From a development standpoint, we found that by far what works best is following the same playbook we follow for the kernel. So engineers, if they need to develop patches on systemd or develop features, they will do so in master, they will send a PR, they will go through the normal PR review process, and then we can backport it internally.
03:41
In the meantime, to test it, and then once it's released, we can backport just the straight commit. And this works a lot better than the opposite process, which will be developing things internally and then upstreaming them, because the upstreaming part isn't an add-on, it's just something that comes naturally that you do right at the beginning. And by far I think this is the best approach when dealing with open source,
04:01
with open source projects and contributions. From a release standpoint, we control the system we roll out using Chef. We found that a process that works best here is when we're rolling a new release to start with a small set of machines that are like hand-picked canary machines from various teams, so these teams can get a first exposure and give us feedback,
04:21
especially if there are some new features that might be affecting them. Then later on we start rolling from 1%, 2% fairly quickly up to 50%, and we found that most of the time we would find issues either in the initial canary stage or when we're at about midpoint. It's not that common to find issues between one and midpoint,
04:41
because that's when people start actually noticing and they get significant exposure. And the way the rollout works with Chef is pretty easy to move back and forth, so we have the ability to roll back if the need arises. Now, I mentioned we are almost everywhere on 2.39, and as always there's a long tail. So I got these numbers fairly recently,
05:02
but we have about 94% of the feed on 2.39, 4% is between 2.38 and 2.35, and a lovely 2% on 2.34 and 2.33. This is kind of annoying, especially when I get someone sending bug reports and the bug probably is like, oh, you're on 2.33? Yeah, we're not going to fix this. You need to upgrade.
05:21
The main reason for these long tail is kernel upgrades. And now kernel upgrades wouldn't normally be a blocker for systemd, but unfortunately last year we ran into a fairly entertaining issue where both systemd and our container manager were poking the TTY subsystem in the kernel in a way that made it not work very well,
05:41
and it resulted in PID1 completely hanging and being useless until we rebooted the machine. Now, this has been fixed, and is the microphone working? Oh, okay. This has been fixed for 16. That's the commit that fixed. Tejon actually wrote the patch and upstreamed it.
06:00
But unfortunately, if you're not running that kernel or a kernel with that patch included, you're out of luck, and you need to update the kernel first. And a lot of these systems that are on the long tail are systems where updating the kernel requires a reboot, and a reboot means a downtime, and if you think about things like, say, network switches, if you reboot a network switch, the whole rack loses connectivity for a while,
06:20
so this needs to be planned. So that's why we still have a long tail. I'm hoping we can get rid of this in the short term. There's generally a lot of effort in automating kernel upgrades and being able to do this more quickly. The kernel team actually gave talks in the past about this process, if you're interested. Another kernel-related thing we hit
06:40
was a bug in the networking stack that was initially related to private network, yes. Basically, when you had private network, yes, enabled, on some services, you would hit RF counting bug in the network layer that led to the process that was using private network, yes, end up in this state, and just stick around forever.
07:01
Not always, but unfortunately, systemd enables private network, yes, by default, for things like HostMD. HostMD is spawned every time you run hostmctl, and we have a plug-in in Chef that runs hostmctl at the beginning of the Chef run. It's a plug-in no high, actually, not in Chef itself. So every time we run Chef, which is only 15 minutes, we would run this.
07:20
Sometimes we would end up with processes in this state. These tend to pile up, and it's not great when you have a lot of these. So we also fixed this. We mitigated this in the beginning by just disabling private network, yes, on the effective services, and then we found that this was actually fixed upstream already. We just backported the patch. We had a couple of other minor issues related to upgrades.
07:41
A fun one was when we accidentally rolled out a version of systemd on a bunch of machines because Yum was rendering an upgrade transaction on LZ4 in a way that also upgraded systemd. That was interesting, especially because that happened on machines that were also affected by the TTY bug, so we suddenly had to reboot a bunch of boxes that we would rather have not.
08:02
Finally, when we built systemd-mock, we had to disable a couple of the tests because they were failing. Actually, before writing the talk, we set about fixing these, and we found there was already a PR upstream fixing the tests we cared about, so that's nice because you don't have to worry about that. Now, that was mostly about systemd running on machines.
08:24
The other side of the story is that if you write software that wants to integrate with systemd, you need to link to libsystemd itself. So if you want to use, say, the sdnotify or the sdbus API. And at Facebook, for software that doesn't run on the operating system itself, not for system software but for the application
08:40
or Facebook-specific software, this is built using our own internal toolchain. So we have our own GCC, nglibc, and friends, which also meant integrating systemd and libsystemd inside this toolchain. And we already had a version of libsystemd there, which was 2.3. But that became untenable when people wanted to actually use things like sdbus,
09:00
which were not available in 2.3. Updating this was quite a bit of work because at the time, 2.3 was still using auto tools, so we had to port this to mason. We had to also make mason work in this system. And then, for reasons I won't go into detail, our system relies heavily on static libraries because everything is statically linked together.
09:21
And the build system and system did not produce static libraries at all. So we ended up fixing this and sent a PR, and after a bit of work, this ended up working. The benefit of this, though, was that once all this work was done, going from 2.3 to 2.3.9 was trivial. It was literally five minutes of work and running a build, and it was done.
09:41
And I expect in the future it will be a similar story. A bonus side effect of this was also that we were able to get rid of a bunch of hacks we had around NSS. So, on a system, ETCNSSwitch is what tells you what NSS modules you use for given operations. Now, if you build things using a separate toolchain,
10:03
it also fetches NSS modules from that place, and some of these modules were not available in 2.3. So we had instances where, in NSSSwitch, we were setting something to use, say, NSSHosting or NSSMyMachines, and we would get errors when running some random Python programs because we would try to load modules that didn't exist. And we had workarounds for it. It wasn't a big deal,
10:20
but it's best if we can get rid of these hacks when you can. All right. Now, let's talk a bit about some cool stuff we've found. I want to start with this because this is a feature that's not very well known, I found, and not very well documented, but it's really awesome. So this is how to do zero downtime restart.
10:40
If you have a daemon that you want to be highly available, and even during updates, you want to be able to restart or update a daemon in a way that doesn't affect ongoing connections. So here's how you do this with systemd. You have your old process and your new process. Your old process double forks and starts the new one. It then uses SDNotify to tell systemd to update the main PID.
11:01
And the main PID is what systemd will supervise and treat as the main service that it should keep alive. Once you've done this, the two processes can just figure out on their own how to handle the transition, like they could use signal. One could just keep handling all connections and then self-terminate. This is up to you. Something else you can do is also use the file descriptor store.
11:23
So if all you need to do is store some file descriptors and pass them along, you don't even need to do this in-flight communication thing. You can just push the file descriptors up into systemd using the fdstore facility, and since they will keep them safe for you, and then you can fetch them back using that function, and they will be just available for you.
11:42
This is a really cool feature that I found not a lot of people know about, and it works really well. We've used this internally in several cases, and it's been pretty great. Also a nice side effect of this is that it automatically makes you make your service into a type notify service, which is also something that people
12:01
don't necessarily want to do on its own because it requires linking to libsystemd, but that is nice because then you have the guarantee that when your service starts and systemd marks it as started, it is actually started, because you can use the std notify API again to tell systemd that I'm starting, I'm starting, now I'm ready to take connections. All in all, this gives us a much better way
12:22
to write resilient daemons. Shifting gears a bit, let's talk about resource management. I won't spend a lot of time talking about this because Tejun and Ioannis had an awesome talk yesterday about all the work we're doing with C-group 2. If you didn't see it, you should watch it. You should also check out Daniel's talk on oomd.
12:40
All of the features they talked about are either already released in systemd or have been upstreamed and will be released. Notably in 2.40, we will land support from memory.min and iodo latency. Roman is also working on support for the device controller for C-group 2. The device controller was a C-group 1 specific thing.
13:00
There's no real device controller in C-group 2, but Roman is working on a BPF-based implementation of this that will provide the same API from a systemd point of view. There's a PR app for this that is currently in review. Finally, on this subject, if you happen to deal with container managers or write container managers, I highly recommend you to read
13:21
the C-group delegation document that was merged in systemd a while ago. This document codifies a lot of conversation that we had over the years on what's the best way to do things, and it makes it a lot easier to understand all the tricky points and interactions that you might have to deal with if you're writing a container manager or anything really relying on C-group's delegation
13:42
with systemd itself. Something else I talked about in the past, but that is now finally open source, is PythonD. PythonD is a library in Python that uses Python to wrap the SDBus API, and in this way, you can talk to systemd
14:00
and interact with the systemdDBus object model from Python. This is something that Alvaro from Instagram wrote. He's actually going to give a talk about this later today, so you should attend that if this sounds interesting. This is something we've been using internally quite a bit. A lot of our infrastructure code is written in Python, and because this is something that only uses libsystemd and SDBus,
14:21
it's very easy to use and it's very reliable. Something we are working on that uses this directly is something called system daemon, which is a small daemon that fetches service metrics from systemd and feeds them up to various monitoring systems. This is something we've been working internally for a while. I'm hoping to get it open sourced sometime this year
14:42
after it's a bit more of a wider deployment. On the containers front, Lindsey had a talk yesterday on containers, so I also won't go into detail here, but the short story is that we are trying to leverage systemd more and more in containers, both within the containers and outside the containers.
15:02
Running systemd-spd1 inside the containers gives us the ability to do proper supervision for services there. It also gives us a pd1 that is better than BusyBox, and in general can deal more reliably with things like dying children. Using nspon, both as a container engine and for building container images, gives us a solid platform
15:21
for dealing with all of the tricky bits of interacting with the system that we don't have to maintain ourselves and that is in line with what everybody else in the community is doing. Finally, we started looking at using portable services because portable services gives us a facility
15:40
for composing services together, and if you can think about this, we can have things like a single Facebook service, say a daemon for doing service discovery, bundled up in an image, and then the same image can run both on a bare metal host and a container as it is using portable services. This is something that
16:01
we are starting to explore now. We are hoping to have more done in the future. Finally, a few words about logging. I said in the past we don't leverage the journal that much yet. That is mostly still the case. Most of our fleet, we run journald, but we kind of notary it and mostly feed everything to syslog.
16:21
There's a lot of work that is being discussed on ways we can make the journal work better for us. The main thing that is missing right now is being able to have per-unit settings so we can control limits and rates and things like that. We've already sent a PR to control some compression settings for the journal, because the other concern
16:41
is the IO usage there. For services not used in the journal, one thing we found a few months ago was that standard output would truncate files by default, and a team that was using these that needed to support append. So they ended up just fixing it themselves and sending a PR upstream, which was nice.
17:01
We're hoping to do more work in this space and to have more to talk about in the future. All right. Now let's talk about some horror stories, or case studies. The first one is something fun that happened on our database fleet. Our database fleet is a bit special compared to other machines at Facebook. The database fleet still runs on CGRP1
17:21
and the database fleet runs with vm.swap in a zero, so they don't want any swap. If there is swap on this machine, it's bad. And something that happened there was that they pinged me showing me that graph, and I can't put taxes on that thing, but you can figure out one is time, the other is swap usage. That is bad. And that happened to correlate exactly
17:41
with when we rolled out system D238 on their machines. And if you read the release notes, 238 enables memory accounting by default, and memory accounting means that every service gets its own little CGRP with its own specific memory settings. Now with CGRP1, one of the settings is swapiness.
18:01
This is something that's only in CGRP1, it's not in CGRP2. In CGRP1, every CGRP has this memory.swap in a setting, which is like vm.swapiness, but far a CGRP. And in this case, we found that on these machines, all of the slices had memory.swapiness set to 60, which is the kernel default, instead of zero, which is what we wanted. So that's why we were getting this nice graph up.
18:22
And this took a bit of digging to find out, but it turns out this is, of course, inherited all the way up, and it's inherited from SystemSlice. SystemSlice gets created very early on, and what changes the setting is a SystemCTL. And SystemCTLs are applied by SystemDCCTL.service,
18:41
which runs after SystemSlice is created. So we have SystemSlice created with vm.swapiness 60, because that was the default, then we changed it, but by then, everything had inherited from there. So we found an issue about this. We also fixed this with a service override for scuttle the service,
19:02
which is basically find on CGRP to override memory.swapiness. Now this code doesn't actually work, because you have to do this res first and not res first. So this is the actual code that works. I do not recommend doing this, but if you hit this kind of problem, you can maybe use the same solution. Another
19:22
fun issue we had with CGRP1 was an explosion of zombies on our Git master. Our Git master was also running CGRP1 and was also not using PAM-SystemD. And that means that under SSSD.service, there were a ton of different processes. And it needs a Git server, so it has a lot of short-lived processes.
19:41
And when we found the box at a very high load, we found that PID1 was not ripping children at all. So we had thousands and thousands of zombies on these machines. And it turns out when that happens, even stuff like PS hangs, because PS these days links to SSSD and talks to SSSD to get information about user sessions. So this was great. It took a fair bit of poking.
20:01
We initially suspected DBAS problems. It turns out, in the end, SSSD itself, the way it does weight speed and processing SSSD child, before it would process the SSSD child, it would run an aliveness check on every process in the slice by calling kill 0.
20:20
And it would do this for every process, wait for the results, and then process the SSSD child. And if you don't do this fast enough, you keep getting more and more zombies, and you never recover ever again. So one of our engineers came up with this awesome and terrifying way of fixing this, which was using ptrace, which is something called ptrace-do, to inject weight speed into SSSD and trick it into calling weight speed every
20:41
second or something, which fixed the problem. Then Lenart actually fixed the algorithm here. And this is a code that runs only for CRP1. So it's actually a pretty good example of something that a code buff that we were not expecting to be triggering, that we hadn't been looking at, that we ended up triggering just because we were running CRP1.
21:01
Finally, a much simpler case, but still fairly entertaining. We had machines that had filers from Luster, and for reasons I won't go in detail here, we do this using NFS, but we do this using NFS in user space. It's an open source thing. And NFS in user space uses Fusee, and the way Fusee works,
21:21
when you mount something, after you unmount the command, you end up with a lingering process to manage the mount, because, well, it's in user space. If you do this from Chef, and from Chef you call bin mount, well, Chef calls mount, and then the process sticks around. But Chef, in our environment, runs as a system disservice. So Chef runs, completes the run, then the CRP
21:40
stays there with this process. Unfortunately, the service has time on stop-sec 15 minutes, because we want to run Chef every 15 minutes. So after 15 minutes, the mounts would go away, which is not exactly what you want. This also took a fair bit of poking. It turns out there's a very simple fix, which is just starting the mount unit, instead of calling bin mount, because then it will
22:00
get running its own CRP. And you can see the code there. This happened to be in one of our open source open source codebooks. But yeah, I hope this gave you an idea of a few entertaining corner cases we hit, and I always like hearing about these kinds of stories, so if you have stories like that, please do share them. And with this,
22:22
I will go with questions, if there's still time. Yeah, one minute. Questions? Yes. I can repeat it, it's fine.
23:08
So the question was about the settings like the CCDLs. One solution, another possible solution is doing this in the initramfs, and the question was whether we
23:21
considered doing this. We thought about it, but in this case, on one side we needed a fairly quick mitigation for the problem, and on the other, all of these things in our case are configured in Chef, and we really didn't want to have two places where they would be configured, because there's already a fairly
23:40
detailed API, and people are fairly used to setting these things in Chef using the CCDL codebook. So while we could add this also to the initramfs, that would mean people then need to change it in yet another place. So I think it's something we could do if the need arose, but I would rather not do it if we can avoid it. But yeah, that is definitely a possible way to mitigate this.
24:10
One more.
24:28
Yeah, can you repeat the question? No, I don't have to. So after the when you were talking about the switch over to a new instance of the of a daemon to internal
24:41
restart. Why not, I mean I understand that this works, but why not serialize the state to MMFD, pass the MMFD to the FD storage, and then stop the service and restart it from there? Yep, that's definitely an option. It depends
25:01
on the type of service. Like sometimes if you have, say, in-flight connection that you need to keep handling, it's easier to have the old process finish that and then have the new processes take new connections. But yes, that is totally an option too.
25:20
Alright, it looks like I'm out of time. Thank you very much.