systemd @ Facebook in 2019
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 44 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/46136 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Producer |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
All Systems Go! 20195 / 44
5
10
17
20
21
30
32
36
44
00:00
FacebookSystem programmingCASE <Informatik>Observational studyProduct (business)BitRevision controlSoftware developerCASE <Informatik>Observational studyMultiplication signPhysical systemFacebookComputer animation
00:33
Inflection pointSystem programmingBuildingOperating systemFacebookProduct (business)Physical systemVirtual machinePhysicalismPoint (geometry)Multiplication signSoftware developerInsertion lossScaling (geometry)Direct numerical simulationComputer animation
01:46
EmpennageDivisorSoftware testingProcess (computing)DisintegrationFeedbackSource codeSuite (music)IterationCodeKernel (computing)Software repositoryVideo trackingPatch (Unix)System programmingMereologyRead-only memorySoftware testingPhysical systemSoftware developerBitPlanningINTEGRALKernel (computing)Virtual machineTable (information)Process (computing)Software repositoryProduct (business)Stress (mechanics)Multiplication signLogic gateBranch (computer science)Point (geometry)DatabaseVideo gameSuite (music)IterationFrictionException handlingFilm editingNumberSoftware bugBus (computing)Patch (Unix)CodeGastropod shellScripting languagePersonal identification numberMathematicsRevision controlOpen sourceSI-EinheitenMoment (mathematics)Equivalence relationEndliche ModelltheorieBuildingConcurrency (computer science)Figurate numberRight angleFitness functionNetwork topologyBinary fileComputer animation
08:05
System programmingCodeExecution unitCondition numberService (economics)Maß <Mathematik>Socket-SchnittstelleWrapper (data mining)Configuration spaceData managementSurfaceSoftware testingDisintegrationPlanningSource codeMultiplication signFacebookCondition numberBus (computing)Physical systemOrder of magnitudeOnline helpFerry CorstenHybrid computerNetwork socketTemplate (C++)Scripting languageOpen sourceAsynchronous Transfer ModeCodeBinary codeGame controllerEqualiser (mathematics)Execution unitCovering spaceWrapper (data mining)Configuration spaceLatent heatView (database)Point (geometry)Boilerplate (text)Kernel (computing)Web 2.0Service (economics)Directory serviceSoftware developerData managementServer (computing)Right angleSpacetimePiSet (mathematics)NamespaceCompilation albumConsistencyOperating systemFlagProcess (computing)Flow separationNumberCuboidGroup actionBitFunction (mathematics)Goodness of fitMiniDiscCategory of beingComputer animation
13:36
System programmingCASE <Informatik>Observational studyBootingDirectory serviceMaß <Mathematik>Network topologyMIDIMessage passingSubsetAuditory maskingMultiplication signPlotterComputer fileVirtual machineFlagGraph (mathematics)Revision controlExecution unitPhysical systemLoop (music)CASE <Informatik>MultilaterationMathematical analysisPersonal identification numberBootingService (economics)Network topologySound effectDot productComputer animation
16:24
System programmingProgrammable read-only memoryInflection pointRange (statistics)BitAnalogyKernel (computing)Level (video gaming)Term (mathematics)Software developerPhysical systemComputer animation
17:16
System programmingLink (knot theory)Source codeProcess (computing)Kernel (computing)Message passingCASE <Informatik>Graph coloringSoftware developerBranch (computer science)Data storage devicePhysical systemWave packetCartesian coordinate systemComputer fileInteractive televisionSpacetimeEndliche ModelltheorieWritingMoving averageMultiplication signExecution unit2 (number)MathematicsParsingSystem softwareModule (mathematics)HomologieMereologyState of matterTerm (mathematics)1 (number)Network topologySoftware bugSoftware testingBuffer solutionVirtual machineDefault (computer science)CuboidLoginParsingBitFreewareMiniDiscInformation securityRevision controlRadio-frequency identificationSet (mathematics)Group actionFacebookRight angleOnline helpValidity (statistics)Lecture/ConferenceMeeting/Interview
24:01
WebsiteSystem programmingLattice (order)Ext functorComputer animation
Transcript: English(auto-generated)
00:05
Hello everybody, my name is David, I'm a production engineer at Facebook, and I'll be talking for a bit about what we've been doing with systemd for the past year or so. I've given versions of this talk before, and hopefully every year there's something new that's interesting. I'll start with a quick recap of the story so far, I'll talk for a bit about
00:22
what we're doing for deployment and how that ties into the development workflow we use for systemd. I'll quickly go through a few new features, and I'll try to close with some case studies if there's time. So without further ado, as I said, I'm a production engineer, I work on the operating systems team, my team is responsible for maintaining CentOS on the Facebook fleet.
00:42
We have a lot of machines, as you might imagine, we have a lot of physical machines, all of these machines run CentOS, all of these machines run systemd. We run CentOS 7 on the fleet, we're starting to prep for CentOS 8, but right now everything is on CentOS 7. And by now we've been doing this for a while, we've been running systemd for at least
01:00
3 years on a wide scale, and it's set to the point where it's pretty much everywhere. It's been quite interesting seeing internally how things moved and how people reacted to it. When we started doing this, people were fairly skittish, we had to do a lot of work explaining to people why we were making the effort to move to systemd when we were doing CentOS 7.
01:21
And now we're at the point where it's the opposite, we have people reaching out to our team and to other teams fairly frequently with ideas they have for new features they want to build that might tie into systemd, or how they might leverage new systemd features for what they're doing. And at the same time we've also started doing a lot of development ourselves around systemd and its ecosystem, contributing both to systemd proper and to tools around it, and I'll go over some of these.
01:46
So how do we get systemd on the fleet? We deploy systemd with Chef from RPMs, we don't run the systemd in CentOS, we build it from github because we want to be able to track what upstream is doing. So at the end of last year we were on 2.39, when 2.40 was released, 2.40 was
02:03
a pretty big release, it took us quite some time to qualify so we ended up skipping it for deployment. We went from 2.39 to 2.41, and then 2.42 which was running now as of today on 98-ish percent of the fleet. We started playing with 2.43, it's not in wide deployment yet but we have it running in
02:24
some places, that's probably what I'm going to start working on when I get back from this conference. The backport we use is based on the Fedora packaging, you can find it there if you're interested. In general this process works pretty well, we've been doing this for a while, we don't have any major issues with it itself.
02:41
The main pain point here is that the long tail is annoying, and the long tail is pretty small, it's 2% of machines, but when you have a lot of machines, 2% is still quite a bit. And there's too many reasons for the long tail. One reason is that sometimes you just have broken machines, and broken machines sometimes don't run Chef, sometimes their RPM database is corrupt, sometimes things happen and the system doesn't get updated.
03:02
And for one reason or another they stick in production and it takes a while for them to go away. I don't care about those that much, because eventually they'll go away and they're broken, so who cares. The thing that's more annoying is that sometimes when we do a release we have to put in place exceptions because we will find, either we will find a change in upstream or a bug or something that affects a specific customer in a
03:23
way that they can't quite update right now, and at the same time we don't want to stop the whole rollout just for them. So we'll pin them to the previous version and then we'll go on. Or sometimes we'll find that something changed and something our customer was doing was either wrong or doesn't fit quite well with the model.
03:41
So you end up tracking these around, we have four or five of these in place right now. I think the oldest goes back to 2.39. We are fairly diligent at cleaning this up, but sometimes you have to deal with that. Now, as I said, the release process works fairly well, but it does take a while.
04:01
Oftentimes from the moment when the upstream cuts are released, when we deploy it in production, the actual manual work of prepping the RPMs for testing is a couple of days of work maybe, but it can take quite a while from getting it to the point where we feel safe rolling it on the fleet. Part of the reason here is because when we go from one release to another, we don't generally do much testing with what's happening in between.
04:25
We will follow what's going on upstream, but we do deployments on the fleet between major releases. So there can be quite a lot of changes that accumulate and that can lead to late last time, last minute surprises for people. The other thing is that there's basically two people doing this, which is me and Anita. So if one of us ends up under a bus, that's not ideal.
04:43
What we'd really like to be able to do is do development and testing concurrently. Most of the time when people do feature development or system, they'll do it on master. Then they'll end up exporting the patch internally and testing it on whatever release we have deployed. This isn't too bad, but it is friction. We would also like to be able to do more and faster integration testing, having better ways to
05:05
find issues early on and have a fast figure process, both for our developers and for upstream developers. So I started looking at what we could do there and we ended up building a little CI CD pipeline for this. This is not open source, mostly because it ties into internal stuff.
05:22
It's also not particularly rocket science. We take the federal packaging, I have a horrifying shell script that replaces the tarball in there with a tarball made from Git master. It runs every day at 10 a.m. It builds the RPM, it runs the test suite as part of the build. If that passes, it deploys the RPM on a small number of machines and we have a daily running on a small number of machines.
05:43
We are working right now on getting this hooked up also with the container testing infrastructure because one of the main customers we have of systemd is the container infra. So this way we can find issues early on. Now this is something pretty simple and yet this let us already find a significant number of issues way before our release. And when we cut 243, when we cut 242 and later 243, this was a lot faster.
06:04
We had this running since March. It led to filing maybe about 10 GitHub issues between GitHub issues and PRs for various things we found throughout it. One thing I want to add soon is integration testing of bare metal as well. I also started looking, there's a test suite on GitHub that Reddit uses for doing the CI based on CentOS hooked to the upstream repo.
06:27
I want to start looking to see if we can run those tests internally as well to have better coverage there. So that's on the deployment side. On the development side, as I said, we would like to be able to do faster iteration and faster development.
06:40
And we would also like to be able to leverage the internal tooling we have for doing code review, for doing CI. Right now the way people do code review for systemd changes tends to be they make a paste bin equivalent of what their patch is and one of us looks at it, which is not ideal. We also already know how to do this because if you look at it, this is kind of like the kernel development process.
07:03
So the current plan is to basically do what the kernel team does. So we are putting together an internal systemd repo that will be just a read-only mirror of what's on GitHub with the same branches, same tags. People will branch off master for feature branches. So if they're working on a thing, that branch of master work on there, made their PR, hopefully they'll build a thing, cut from it and test it before they make a PR.
07:25
But you know, this way at least they can get signal on what's going on. When we make releases, we'll branch releases off a pre-release tag, cherry-pick from the feature branches, cut the release. This is also the benefit that we can get rid of the hairy pile of patches we use right now with the RPM packaging and we'll just have a simple script to grab patches from the Git tree.
07:46
This is exactly what the kernel team does. We think it might work and make life easier for us and hopefully lead to having better and faster feedback outside as well, but we'll see. And I'd actually be interested to hear if other folks here do internal development for systemd, what development process you use or if you build tooling around this.
08:05
Now let's go over quickly a few new features that landed recently. I'm not going to spend too much time on this because there's been a lot of other talks from Facebook people on things and I'd rather have them talk about what they work on. One thing that hasn't come up yet that is pretty cool and ended up in 243 is exact condition.
08:23
Exact condition is something that Anita developed. It's kind of a hybrid between condition and exact start free, where it run commands before the unit is started, actually before the pre-scripts run. And depending on the execute of the command, it can pass, so it will keep running the unit, it can fail, mark the unit has failed, or it can skip execution, kind of when a condition fails.
08:45
Now why would you want to do this? Well, I can tell you why we want to do this. The reason we want to do this is that we want to gently nudge people to do continuous deployment of their tools. So we want to have a tool that checks the binary and if the binary is too old, just refuse to start the service.
09:02
And then do a bunch of other things, but that's one of the main reasons. And this is a fairly simple and straightforward way to do it. There's a possible improvement where we could maybe have a percent specifier so we don't have to copy the name of the binary there, but that's just sugar. This is actually a pretty good example of, I think, feature development that goes well, because
09:24
we came up with the idea of maybe we should do something like this before Christmas internally. We discussed it, we played with a few ideas, we were in Brno in February for DevConf, we met with the system developers, we discussed this with them, we brainstormed on possible designs, and we ended up with, oh yeah, doing like this seems to be simple enough and it could work,
09:45
and then it was coded in the months afterwards and it landed in 2.43. I think this is pretty much the ideal way you want development to go. On the resource management front, there's already been several talks. Teju and Dan's talk covered resource control in general. Daniel and Nita talked about UMD.
10:01
Johannes is going to talk later today about Senpai. Two things I want to raise, there's disable controllers that landed for transient units as well. Disable controllers is quite handy because it allows you to turn off specific controllers without having to rely on kernel command line flags, so without having to reboot the box. The other thing that landed is a number of OOM specific control for the kernel OOM, not
10:23
the user space OOM, around CGRUB2, notably OOM policy, so you can apply OOM settings to specific CGRUBS. Something else we've been working on for a while is Python D. Alvaro did a lightning talk on this yesterday. So Python D is available there. It's a thin-sized wrapper on top of SDBUS.
10:41
It wraps the SDBUS API with the idea of making it easier to interact with system D, but it also allows you to poke at the bus in general. Right now, this supports pretty much all the DBUS properties exposed by system D. It's been working quite well. We've been very happy with it and we've started building quite a lot of tooling around it internally. I would like to see this more used in general because, at least in my experience, it's one of the most stable ways to interact with DBUS from Python.
11:10
Once in the landing recently, there's also socket support, so you can do fairly neat or terrifying things depending on your point of view. That's just an example you can import to try, but what that does under the hood is
11:20
that it makes a transient socket, makes a transient service, then forks off a little Python web server. And it ends up being managed as a proper service with a proper socket, which is nice. I mentioned before we use Chef for config management. We have a cookbook called fp-systemd for managing system D on GitHub.
11:41
It's been there for a while. There was quite a bit of work on this in the last half or so, mostly internals. One thing that's interesting is when you write on disk system D units, you generally end up doing that using templates. Using templates for managing overrides is what we were doing before.
12:01
It's really annoying because you end up writing the same boilerplate code, which is make the directory, make the template, then delete the directory, clean up the template, reload system D. It's obnoxious. So I wrote a little custom resource in Chef that lets you drop an override and internally figures out where should it go. It cleans it up when it needs to be cleaned up. It reloads system D when it needs to be reloaded.
12:23
This is pretty useful and it's straightforward enough and the syntax is about the same as the upstream system D unit resource in Chef. Then one more thing that Chris Down has been working on lately is a linter for system D units.
12:40
When people use system D, they find about a lot of features that system D has and they start using them. Some of these features are great. Some of these features we'd really rather them not use them. Like one example is people using kill mode equal process and not really understanding what it does. Or people using interesting settings for namespacing without really understanding them.
13:02
So all of these are things that are well suited for linting. There's already a bare bones, it's not really a linter, it's more of a consistency checker built into system D dash analyze. This is meant more of a general purpose linting tool where you can define a policy for the things you care about and then it can surface them. We have this running internally. It exists and it works.
13:23
We would like to open source it by the end of the year. It's a standalone tool so there's nothing Facebook specific in it. Hopefully people will find it useful and maybe it will help other companies and folks prevent issues there. Okay, I have a couple of horror stories on the same theme of implicit dependencies.
13:41
So on this one we had a bunch of machines where after we rolled out a new system diversion, I don't remember if it was 241 or 242, we started seeing that NTP was not starting on boot. After considerable digging we discovered that the NTP service in Sentos uses private temp, which is fine.
14:02
Except private temp internally takes an implicit dependency on temp.mount which I did not know and only found out after digging through this. This would be great except on these machines because of the way they were set up, people didn't really notice that the way they were set up was that they would boot.
14:20
Temp.mount would start and do its thing, but then we would mask temp.mount in Chef. So you would end up with this unit that would be both active and masked, which is probably not something that's supposed to work. And in fact while this works in 2.39 in the sense that by works I mean it doesn't complain, on later versions SystemVue
14:41
will hard fail and refuse to start a unit that happens to have a dependency on a thing that is both active and masked. So yeah this was not great, this was actually one of the things I mentioned before where we had to pin this fleet to 2.39 for a while while we figured this out. It took us probably a week working on it down and off because it wasn't really
15:02
a showstopper, we could keep it running just fine when it was back on 2.39. So yeah implicit dependencies are kind of annoying. We had another case like that where we had some hosts where some direct searches were on their own boot. After digging we found that the temp file was not being created because SystemVue temp file setup never ran.
15:23
And SystemVue temp file setup depends on local effects which depends on swap. Except one of my colleagues was working on this cookbook called FpSwap to do encrypted swap and a bunch of other things. And he added a whole bunch of masked units as dependencies of swap.target. Which makes the whole thing fail, so the entire thing gets pruned, it doesn't exist anymore and you end up with no temp files.
15:45
The way we debugged this by the way, both of these problems, but this one specifically, was you use SystemDeanalyze, SystemDeanalyze has two really handy commands, plot and dot. One will give you a butcher style plot of the boot, the other will give you the octopus style dependency graph.
16:00
The dependency graph is completely useless in our case because it ends up being this gigantic thing. But you can tell it to only show you a subset of it, which is very helpful. You can also enable debug logging in PID1 either by killing it with a special signal or by passing a command line flag. This is actually how we found this out, because we ended up seeing the debug message that said this tree is getting pruned because this thing doesn't exist anymore.
16:24
And that's all I have. Questions?
16:49
Hi, you mentioned that you are starting to think of debugging SystemD like debugging the kernel. Developing the kernel, but also I guess developing involves debugging. Can you elaborate a little bit more on that
17:05
analogy between kernel level stuff, are you bisecting things, are you the whole range of what that would mean? So I meant that primarily in terms of development workflow more than specifically on the debugging side. In terms of development we have a very well honed process where we have
17:21
an internal kernel tree, we use exactly the feature branches and the release branches model. We have automated testing for the kernel, we have CI, we have automated deployments, we have a very good way to understanding either via CI or via AB testing whether a specific change is going to impact things and how it's going to impact them. And we would like to bring that to SystemD as well and eventually to other system software we work on.
17:45
For debugging specific issues, I mean it ends up becoming, it's very different actually I think than developing the kernel in the same senses and it tends to be very hit or miss. What I personally do is play with the debug logging, play with tools available on Box,
18:03
surprisingly often times tracing PID-1 ends up being very useful for understanding what's going on. We had a couple of cases where we ended up with PID-1 deadlocked or in bad states. I talked about them last year actually and in those cases tracing was how we found out what the hell was going on there. We had interesting and tricky interactions between kernel and user space sometimes especially when there's API mismatches or things that change.
18:27
But that doesn't, I say that doesn't really happen that often nowadays, we got pretty good at that.
18:44
I would assume that at Facebook you are logging a lot of stuff. Do you use JournalD or do you have a different logging system? Are you logging remotely or locally? We run JournalD on every machine in the fleet. By default we run JournalD with a 10 megabyte volatile journal.
19:07
And then we run rsyslog on all the machines and rsyslog ships the logs off somewhere, I don't know what happens afterwards actually. There is a lot of collection infrastructure managed by the security team for that. The reason we do this is because people really like being able to grab valid messages and stuff like that.
19:25
And there's some amount of tooling that also does automation sometimes based on that. We've been trying to move people towards the journal because usually what happens is that they give up training internally on this and people realize the journal is nice. And they would really like to use it and they find out all the journalctl commands they could use and then they ask me if they can deploy it.
19:45
It's kind of difficult to have both the journal and rsyslog running concurrently because you end up double writing and causing a lot of extra I.O. And we have applications that can be extremely chatty in terms of logging so if you end up writing 3 gigabytes a second of logs that's not great.
20:01
One thing that will help, we've been looking at ways, right now the journal is a bit all or nothing endeavor because either it's entirely volatile or it's entirely on disk. We've been looking at ways that we could have a per unit setting here and that would help transition things over because in a lot of cases it would be one application team that would really like to use the journal but maybe everybody else not quite yet.
20:23
I think this is one of the things that I might play with when we start rolling out the next CentOS release and see if we can couple with that rollout maybe not gym people to use this more.
20:41
Have you looked into like using rsyslogng because it has modules to slurp up the journal d and as well as syslog entries and other things together such that those who want wrapping rsyslog they can and those who use journal they can and you're writing once and storing once.
21:03
Well you can do that but you still end up writing twice though don't you? Sure but then when you run journalctl you only get the small buffer you have because that's kind of the problem there.
21:23
We actually have our syslog set up that way right now I think it uses the module to get the imk journal or something. Yeah that's kind of the problem because if you only have a small volatile buffer then if you actually want to use journalctl or if you do systemctl status on something you see no more logs and that's a bit annoying.
21:47
Regarding the linter tool I was wondering since you mentioned it's more than just linting maybe analyzing too why not integrate it into systemd analyze and might then if not you know why not
22:00
and then also then with the versions of systemd changing and unifile being different how do you manage that? So some of the things that we wanted to do in the linter actually ended up in analyze. One thing specifically was parsing time specs and validating the time specs like we do on calendar or something were correct that ended up being implemented as a feature in analyze itself.
22:23
I think implementing the whole linter as part of analyze is something that would probably be possible. I think it would maybe be kind of out of scope for analyze itself especially because we wanted specifically this to be pluggable in terms of policy and have maybe the ability to have more complex policies.
22:42
I don't know how well that would fit there. Chris is actually there and he can answer that more usefully than I can most likely. Thank you. So another reason is because if you look inside the systemd source like one of the
23:04
things we have is kind of we have a lot of little bits of unit state. We have a lot of little bits of like go look for this thing and store it somewhere. We don't really have a way to kind of homologate that back to here is how the unit file looks or here is how you know you can get something in systemctl show.
23:22
But it's hard to map that back to like here is how we got to that outcome. And that's often the thing we need to know is like how do we get here. So currently it's hard to just make it part of systemctl analyze because I mean you would you can do it but you would again have to kind of reinvent the unit parser and do all kinds of other stuff.
23:43
And I think it doesn't make a whole lot of sense. Thank you Daniel for your talk. We got a quick 500 break. Sorry.