We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Systemd @ Facebook - a year later

00:00

Formale Metadaten

Titel
Systemd @ Facebook - a year later
Serientitel
Anzahl der Teile
47
Autor
Mitwirkende
Lizenz
CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache
Produzent

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
We'll be talking about what we learned throughout the past year running systemd in production at Facebook: new challenges that have come up, how the integration process went and the areas of improvement we discovered. We'll also discuss our efforts building a monitoring solution for system services based on systemd.
37
SystemprogrammierungFacebookDatenverwaltungWeb ServicesCASE <Informatik>Produkt <Mathematik>Wort <Informatik>BeobachtungsstudieMathematikFacebookCASE <Informatik>Web ServicesDatenverwaltungZeichenketteWeg <Topologie>Physikalisches SystemComputeranimationXML
SystemprogrammierungVirtuelle MaschinePhysikalismusPhysikalisches SystemMultiplikationsoperatorFacebookKonfigurationsverwaltungNetzbetriebssystemNichtlinearer OperatorDienst <Informatik>Web SiteXMLUML
SystemprogrammierungMeta-TagMigration <Informatik>Dienst <Informatik>Desintegration <Mathematik>Physikalisches SystemPhysikalisches SystemDienst <Informatik>ResultanteComputerspielEllipseMereologieWeb ServicesBitDämon <Informatik>StandardabweichungProgrammbibliothekData MiningXML
SystemprogrammierungBinärdatenKernel <Informatik>FreewareBitPhysikalisches SystemForcingVersionsverwaltungVirtuelle MaschineFacebookAggregatzustandDatenbankGarbentheorieWald <Graphentheorie>TransaktionInzidenzalgebraPunktSoftwaretestProgrammbibliothekRechter WinkelQuellcodeSkriptspracheMultiplikationsoperatorSchlussregelPi <Zahl>Basis <Mathematik>Folge <Mathematik>Minkowski-MetrikWasserdampftafelMereologiePatch <Software>ComputersicherheitRechenschieberVerschlingungSymboltabelleEllipseLoginSichtenkonzeptProgrammierumgebungHilfesystemAtomarität <Informatik>SynchronisierungZusammenhängender GraphProjektive EbeneCodeCASE <Informatik>GeradeDivergente ReiheMagnettrommelspeicherDatenverwaltungElektronische PublikationTwitter <Softwareplattform>SoftwareentwicklerKernel <Informatik>GruppenoperationMathematikNabel <Mathematik>ZahlenbereichProzess <Informatik>VerkehrsinformationFließgleichgewichtKategorie <Mathematik>Lipschitz-StetigkeitDemo <Programm>Heegaard-ZerlegungGüte der AnpassungWrapper <Programmierung>Repository <Informatik>Serviceorientierte ArchitekturHardwareSchießverfahrenProgrammfehlerPhysikalische TheorieWeb-SeiteÄhnlichkeitsgeometrieBinärcodeBus <Informatik>UmwandlungsenthalpieSystemprogrammSpeicherabzugXMLUMLComputeranimation
SystemprogrammierungInverser LimesDienst <Informatik>PartitionsfunktionDämon <Informatik>DatenverwaltungMultitaskingPhysikalisches SystemDatenverwaltungInhalt <Mathematik>ServerBenutzerbeteiligungVirtuelle MaschineIntelligentes NetzRandomisierungLeistung <Physik>GamecontrollerDemoszene <Programmierung>GruppenoperationDienst <Informatik>Prozess <Informatik>SchnittmengeHalbleiterspeicherEndliche ModelltheorieImplementierungPunktInverser LimesQuaderWeb ServicesMereologieGesetz <Physik>RechenwerkProgram SlicingPlastikkarteKernel <Informatik>BefehlsprozessorFortsetzung <Mathematik>BeanspruchungRöhrenflächeMathematikElektronische PublikationCASE <Informatik>InformationsüberlastungKugelkappeDruckspannungPhysikalisches SystemFokalpunktComputerunterstützte ÜbersetzungDämon <Informatik>KonfigurationsraumSpeicherverwaltungSoftwaretestHardwareDefaultStrömungsrichtungMetrisches SystemHierarchische StrukturFunktion <Mathematik>XMLUML
SystemprogrammierungWeb ServicesEreignishorizontKonfiguration <Informatik>Dämon <Informatik>ImplementierungDatenmodellReelle ZahlDienst <Informatik>Web ServicesDatenverwaltungMetrisches SystemOpen SourceZustandsmaschineÄußere Algebra eines ModulsBinärcodeBus <Informatik>MaßerweiterungProjektive EbeneVirtuelle MaschinePhysikalisches SystemLoopEreignishorizontRechenwerkQuaderDämon <Informatik>SchnelltastePrototypingWrapper <Programmierung>CASE <Informatik>Reelle ZahlKategorie <Mathematik>ZeitstempelObjekt <Kategorie>TermFacebookFormale SpracheBitPay-TVSichtenkonzeptMathematikBeobachtungsstudieRückkopplungHook <Programmierung>PunktRepository <Informatik>POKEInteraktives FernsehenMultiplikationsoperatorInformationsspeicherungRechter WinkelMAPBenutzerschnittstellenverwaltungssystemPolygonAutomatische HandlungsplanungCodeSchreib-Lese-KopfMessage-PassingXMLUMLComputeranimation
SystemprogrammierungCASE <Informatik>Physikalisches SystemDienst <Informatik>ZustandsdichteNP-hartes ProblemMaß <Mathematik>MakrobefehlROM <Informatik>Inverser LimesProgrammschleifeZufallszahlenKontrollstrukturWeb ServicesBeobachtungsstudieCASE <Informatik>Bus <Informatik>Produkt <Mathematik>Rechter WinkelSchreiben <Datenverarbeitung>Delisches ProblemPuffer <Netzplantechnik>Message-PassingComputersicherheitPhysikalischer EffektFamilie <Mathematik>TopologieQuellcodeGradientPhysikalisches SystemFacebookWeb ServicesProgrammschleifeDienst <Informatik>HalbleiterspeicherVerzeichnisdienstElektronische PublikationRechenwerkLoopKontrollstrukturKartesische KoordinatenDatenverwaltungBasis <Mathematik>QuaderAussage <Mathematik>Einfach zusammenhängender RaumAggregatzustandNotebook-ComputerVirtuelle MaschineMultiplikationsoperatorPlotterGraphMultigraphGeradeGüte der AnpassungStapeldateiDateiverwaltungPufferspeicherEntwurfsmusterSoftwareLeistung <Physik>MakrobefehlDefaultEinsFehlermeldungTeilmengeZahlenbereichSoftwaretestProjektive EbeneServiceorientierte ArchitekturBinärcodeDämon <Informatik>ValiditätFlächeninhaltInstantiierungKonfigurationsraumNotepad-ComputerServerMini-DiscInformationsspeicherungÄußere Algebra eines ModulsProgrammfehlerLoginBootenZentrische StreckungMereologieXMLUMLComputeranimation
ProgrammschleifeZufallszahlenKontrollstrukturWeb ServicesCASE <Informatik>SystemprogrammierungMaß <Mathematik>MultitaskingMathematikNabel <Mathematik>Mathematische LogikRechenwerkMultiplikationsoperatorSoftwaretestProzess <Informatik>Virtuelle MaschineGeradeBitSoftwareVersionsverwaltungCodeSchwellwertverfahrenQuaderBefehlsprozessorApp <Programm>DateiverwaltungKontrollstrukturPunktMixed RealityATMWeb ServicesKonfigurationsraumRelativitätstheorieResultanteProgram SlicingMaskierung <Informatik>DatenverwaltungPhysikalisches SystemCASE <Informatik>ProgrammfehlerKonfiguration <Informatik>Kartesische KoordinatenGamecontrollerNabel <Mathematik>Mathematische LogikFunktionalWrapper <Programmierung>MathematikMereologieDienst <Informatik>SI-EinheitenOpen SourceDefaultOrdnung <Mathematik>Natürliche ZahlBildschirmmaskeFontQuellcodeOffene MengeGewicht <Ausgleichsrechnung>SystemaufrufTemperaturstrahlungBitrateGruppenoperationBeobachtungsstudieVorzeichen <Mathematik>Kategorie <Mathematik>Proxy ServerOrtsoperatorFigurierte ZahlLie-GruppeAggregatzustandXML
CASE <Informatik>KontrollstrukturNabel <Mathematik>Mathematische LogikSystemprogrammierungSpeicherabzugTaylor-ReiheUniformer RaumRechenwerkVersionsverwaltungOpen SourceQuellcodeRückkopplungPhysikalischer EffektMathematikHydrostatikKartesische KoordinatenPhysikalisches SystemCASE <Informatik>SichtenkonzeptVollständiger VerbandSpeicherabzugPunktE-MailWeb ServicesProjektive EbeneGebäude <Mathematik>Hecke-OperatorMailing-ListeDTDXMLUMLComputeranimation
SystemprogrammierungKernel <Informatik>Web ServicesInteraktives FernsehenRechenwerkNabel <Mathematik>Digital Equipment CorporationTermCodebuchLuenberger-BeobachterRuhmassePhysikalisches SystemGamecontrollerGradientSoftwareentwicklerArithmetisches MittelBitWiderspruchsfreiheitDatenbankAnwendungsspezifischer ProzessorCASE <Informatik>RFIDRootkitOpen SourceDämon <Informatik>Bus <Informatik>MultiplikationsoperatorEinsSpeicherabzugFehlermeldungXMLUML
SystemprogrammierungXML
Transkript: Englisch(automatisch erzeugt)
All right, let's get started Hello everybody. My name is Davide I'm a production engineer at Facebook and I'll be talking about what we've been doing with systemd for the past year So before we begin here's the agenda for today. I'll start with a quick recap of the story so far
We'll talk about how we are keeping systemd updated in our fleet and how we are tracking upstream changes Would focus on a couple of things we've been working on lately around resource management and service monitoring Then we'll discuss a few case studies that hopefully showcase a bunch of interesting problems We've seen and close with a couple of words about advocacy
so I Was at system decomp about a year ago and at the time we were We were moving the fleet for a center six to seven My team works on the I'm on the operating system team My team is responsible for the bare-metal experience of the fleet So we maintain what keeps the physical machines at Facebook running. We maintain the operating system, which is sent us We take care of packaging we take care of configuration management using chef among other things
Our fleet is made of hundreds of thousands of physical machines spread around various data centers in the world And all these machines run centers and that's what runs the website and everything else. So about a year ago We were moving from six to seven and on six we had like five or six different crazy ways of supervising services
On seven we have systemd and I'm happy to say that now we're Everywhere we're running center seven and ever running system D Which makes me personally very happy Because we managed to get rid of all of these crazy ways of doing service supervision And as part of these we got to migrate a lot of services to system D
And we got to see a lot of people that started building their services with system D mine I started leveraging more and more features And one thing we also need to help these we started integrating lib system D in our internal build system So the people are now able to use Features Like the subject deviation and all these features directly in our demons that make people's life a bit easier
This talk will focus primarily on the bare metal Zill and Zoltan are going to give a talk later today about containers So if you're interested about system D in containers, I recommend you attend the later talk so Let's talk a bit about how we are managing system D on the fleet
So in general at Facebook We have we we manage machines using chef and we have a system to do package updates on machines in a controlled fashion So we can say this package has been to this version and then 1% of the fleece gets this other version and 2% 5% and so on and so forth and that's the system we use for managing updates of system packages So about a year ago your own system D21 and we went from 31 32 32 33
34 right now we're about we have about half the fleet on 233 and the other half on 234 and I'll explain why in a few slides We also started testing 235 We we in general we try to run we run center 7 because we want a stable base
We want to be able to pull security updates in an automated fashion, but we also want to have a modern user space So we backport a lot of core system components from fedora from fedora height So these components are system D, of course, but it's also a lot of the ancillary ecosystem So we backboard things like D bus and util Linux And proc PS and a lot of these basic system tools
So the experience you get using the system is somewhat similar to the experience you would have using a fedora system At least from the point of view of a developer We publish these backboards on github and you can get them at on that github org These are just the spec files and whatever patches we have System D there is on 235
This is actually something that came out from last year's conference because people were actually interested in these So we made an effort to get them published and then Jana Reddit was kind enough to make a copper Of from these packages So if you happen to run center 7 and happen to want a modern system D You can get binary packages directly from there
And those are also mostly up-to-date and if you need earlier versions, you can go back in the in the github history Now of course having a lot of machines not the updates aren't always most and things can happen and I'm gonna go over a few a few of interesting a few interesting things that happen during package updates that might make Make things a little bit more interesting. So the first thing is not actually a system these specific thing
it's something that happens in general when you are dealing with a large fleet of red dot systems and Have to add a package on all of them and it's like issues generally around rpm. So you Machines can get in bad states for various reasons. You can you can have issues around power loss You can have people running kill-9 on things or processes running kill-9 on things that can leave the system in
Weird and interesting states and this can result to in fun situations So some some things we found that we always get these every time we update a major package on the fleet We always get a sizable number of these things and we get issues like duplicated packages So you end up with a machine that does both system these two three and system these two three four installed at the same time
Which is not ideal And you you fix this using something called package cleanup, which is a pack To part of young utils, but of course you don't want to do this by hand So what we do is that we have a small shell script that runs at the beginning of every time We run chef on the machine chef runs every 15 minutes
To converse the machine to the steady state before chef runs. We want to make sure the machine is not terribly messed up So we run this package cleanup wrapper that just runs package cleanup in various ways and tries to resolve the transactions back and forth And that's one thing the other issue you can have is general issues around RPM DB corruption And this happened especially if you if you happen to like kill RPM in the middle of a transaction
there's a very good chance you'll end up with a database in bad state and There's not really a single Recipe to fix this like a lot of time you have to try various solutions So we wrote a tool called DC RPM that takes care of these and tries a Few remediations back and forth and wiggles the database until the machine is in a like non terrible state
The package cleanup stuff is not terribly interesting because it's just a shell script This year PMM is actually somewhat interesting and we are we're looking at trying to get it open-sourced We just we just finish it rewriting it in a way that should be more maintainable So with this shoot together you you tend to get with these three mediations together
You tend to get a system in a reasonable state unless it's like really broken because of say hardware issues But then you can end up with interesting other interesting problems Like I we we've seen when we did the I think it was to 31 to 32 We we saw that we had a lot of machines that would come up They would they would do the upgrade and then suddenly
basically Nothing would really work that you run system CTL and it will fail and the machine would be really sad and it turns out For reasons unclear to this day you we ended up with p2 one running to 34, but the libraries system delivers from to 33 and there's they're dynamically linked and Nothing really works at that point. So my fix for this which I'm not proud of but works is I
Be in the sequence of remediation running LDD on the binary for paid one grabbing for stuff missing and forcing every install of all the packages to the right version Surprisingly these works It's a fairly crude solution and like it's not great
But these together puts that in a state where we can update We can update system D and other major packages and like we do similar things for other packages And the first two are generic the last the last thing is specific to system D But these together puts us in a state where we can we are okay We can update system the other pages on the fleet in a reasonable fashion
then there's the other side of the coin where like we have to track what's going on in upstream and and upstream exchanges and one change that happened a few release ago was the change of the build system from all the tools to mason and If you're not familiar with this mason is this new Python based build system and upstream System D transition for one to the other and they had both supported in one version and they dropped all the tools and that's fine
Okay, we we expect it to do a bunch of work Luckily, the fedora package has already all the work because the fedora packaging was moved to mason almost immediately So we had time to basically rebuild our backport on top of the new packaging and it was mostly okay The annoying bit there was that mason incentives didn't actually work at all
So we had to backport mason and ninja and a few other ancillary Python things to keep it happy and fed But but that that worked well, one good thing though, we got out of this was some improvements on the company's so company's is
Something that's really old and cranky about that Unfortunately, we have to deal with so sent of seven ships with system D to 19 I believe like a fairly old version of system D and that version was before there was the library Split so older version of system D shipped with split libraries. So your lip system D demo only system D login lip system D
Whatever newer version ship. We just leave system D and all the symbols are there It happens the packages like say Apache or Samba or like a lot of system packages. You really don't want to rebuild Link to the old libraries. So if you want to use the new system D, you need the old libraries in one way or another And I've seen top support for this because they're ancient So we used to have these pretty nasty patch to reinstate them and plumb them in the build system and that that worked
It was quite a pain to keep it poor ported, but it was fine With the move to mason that patch had to be tossed because it wasn't really workable so we started approaching alternative solutions and the thing we came out with was Stealing some of the code from that patch making it into a standalone project and leveraging the sub projects feature in mason
So now we have a standalone thing that When it builds it picks up a copy of the latest system D builds the builds it Builds the compile libraries and links at advanced like symbol and gives you these dot SO's and these work So we publish this and it's available on github on that org in case you you need it
We also publish the RPMs that we use in In the other repo in the RPM backwards so you you can also directly get spec files We're gonna keep this mostly up-to-date and in sync with system D Even if you don't really need to because these symbols are so old that they're not really expected to change
Now, okay, then there's interesting things that happen because we have a special environment so we're all Well, I said before we are on 2 3 3 half and 2 4 2 3 4 half The reason is because we were doing 2 3 4 and it was fine We would lose a few machines, but you always lose a few machines. That's okay
And then we bumped the shard to do the update to go to like 50% And we lost a lot of machines and that's not great and we started looking at what was happening and you go on these machines and SSH takes five minutes to let you in you get on the machine and system D just hangs like you do system CTL and it stops and These things are not good
You run a trace against P1 and you see it's hanging trying to open a TTY trying to open TTY zero which is really not good and We pinned this down that it ran deep on the re-exec It went from 2 3 3 and 2 3 4 and 2 3 4 got stuck trying to open this TTY After some digging it turns out our container manager Tupperware also
happens to fiddle with the TTY and the code in the kernel that deals with this is pretty old and There's probably a race there and we don't know yet what this is We're still trying to figure it out But the current theory is that there's a use-after-free bug in the line discipline code on your TTY subsystem that is triggered Sometimes if you call demon-re-exec on a machine that's also running our container manager that happens to do things to TTY zero
In some cases we have a artificial repo of this but the actual repo is pretty hard to get So the fix is that Tupperware probably should not fiddle with TTYs at all That's like all the code we should we should not have so we can fix that
We just make it use a PTY but there's also a bug in the kernel that our kernel folks are trying to figure out And that's an example of something that you're unlikely to get if you have a few machines or even a lot of machines But when you have hundreds of thousands of machines this kind of crazy things Unfortunately come up All right. Now, let's talk about interesting things. We're doing with systemd and the first thing is resource management
so we're interested in resource management because we run a lot of things on our machines and We want to be able to control and make sure the machine as the the machine is doing what is supposed to do and the Process that is supposed to run that does actual work say the web server Has the resources it needs and doesn't get contention from like random auxiliary services say the thing that controls power on the machine
So we do this using C group 2 and I'm not going to talk about C group 2 because my co-worker Chris is gonna Do a talk on this tomorrow. So if you're interested in C group 2 you should attend that For the purpose of this C group 2 is a kernel API that lets you set resource limits on Processes and by resource limits. I mean things like memory CPU and IO
And systemd leverages this and systemd lets you apply limits to your processes and to your services And it also lets you bucket your services and partition them in slices and apply limits to the slices as a whole So we use this and we use this to bucket the services and apply these limits. That's part of the picture
That's the implementation You also need a way to To tell though what's going on and we do that With a small daemon that runs on the boxes that picks metrics from C group and shows them in our monitoring So we can get data around. Okay, the limit is set here and the current memory usage is there
This is actually working or oh, this is thrashing. There's a problem here The other thing we do is that we have an API in Chef that lets people Define and set the changes without having to go on individual host and like do systemctl edit and change things And the way we do this is that we have an API in Chef and this translates to systemd override files
So it applies the changes and you can say change the memory limit for this thing here or move this service from this bucket to this other bucket And this is something that's very much in progress and we're still learning what the right things to do are this is the General hierarchy we're using right now where there's a system bucket and that's Stuff that runs on the box that needs to run on the box
But it's not critical to what the box is supposed to do. There's workload which is what the box is supposed to do say MySQL or HHVM the web server or whatever and then we have another bucket that's called TBD Inventively that's meant for stuff that we need to run unrestricted Say the hardware folks have to do a stress test on the box and that that doesn't need to be limited because it's it's a
So the initial idea was okay, we like cap system to four gigabytes of RAM will leave workload unlimited it will be great It's it's really not First because our working set is way larger than four gigs. So if you do that the machine immediately dies Also because it turns out
It's not quite easy to tell what a workload is in a lot of cases Say your workload is the web server But the web server gets configuration data from another daemon if that daemon is in system and gets capped and becomes really slow The web server is very angry. So you to fear something out there The way we're addressing that is by making a sub bucket under workload and moving some of these demons there
So we can control them better. The other thing we're doing is shifting the focus from doing Memory limits and hard limiting in general to doing protection So doing things like system DS like memory low or memory dot low in secret problems that give you protection like guarantee that these
This service will have at least these amount of memory available rather than a hard limit These these other service and this this isn't always the solution and all of these stuff is is hard Like it requires you to understand In fair detail how your service works what the dependencies are how memory management in the kernel works and now these apply together with cgroups
So there's a lot of work to do to make this More simple and easier and where we hope to be able to get to the point where we can give people a tool that They can run and can give them an idea of okay. These are saying the defaults for your service based on what we see Start with that and try
so another thing we we are interested in doing is service monitoring and That that stands from the fact that as I said before we had like five or six ways to do service monitoring service management For bare metal, but they were all like fairly blind Like we didn't have a good way to pull metrics from services But system D does because system D knows a lot of things about you because it's supervising you directly
He stores them and crucially it exposes them over D bus So you get properties like all the timestamp properties When did they start how long have I been running when it when was I restarted the last time? 2.35 edit a thing. That's really awesome. That's and restarts That's by far the main thing that people ask me that I would like to know like is my service flapping
People are really interested in that because if your service is flapping it's very likely that something is wrong So these kind of things are useful and they're they're easy to get out of system D to D bus system D also gives you status events where you can you can hook to the service on the bus and
You can hook to system D on the bus and it will send you a notification whenever the service changes So it goes from inactive to active and so on and so forth So it gives you basically a view of the state machine and what's going on Now the downside of these is that these are exposed to D bus so you need to talk to D bus to get them out and So we started looking at ways to do this Well, so by the way, you don't need to talk to D bus
You can use systemctl show and it will dump them all The problem of doing that is that you don't want to run systemctl show in a tight loop on every machine Because that's gonna take up a lot of resources and it's fairly brittle. So you'd really want a programmatic way to do this So if you look online, there's a few there's a few
Thai projects I'd say on GitHub that that already do this to some extent and they all of them use Either Python D bus or libd bus Which would be great except after two weeks trying to get that to build in our internal build system. We essentially gave up And I still I'm still not quite sure what was going on there and it's definitely not an issue with with D bus itself an issue with like how we do things at Facebook, but
the Short long story short. It was just it was not resulting in binaries that would run So we started looking at alternatives and and obvious alternatives will be using SD bus Which is included in systemd for a while The downside there is that SD bus is a plain C API and there are no they're not really any wrappers for it
I really didn't want to write this in C because then I would have to maintain it and like most people in Infrared Facebook are like Python people and not C people There's also C people but they're harder to come by And also for a prototype because they don't quite know if this would work like writing a prototype you see not not
Not a great idea at least from my standpoint So I started looking at things and I found that the chorus folks had Had also bindings in go and these bindings don't rely on libd bus their standalone And they work so I don't know go but I figured I could pick it up So I did that and I wrote a POC of these in go
And this is basically a small demon that runs on the box it hooks using the D bus API to systemd it polls for units properties and uses subscriptions to get events and Then he collects this data internally Messages it a bit and then shoots it out to a few of our monitoring systems So we can get like pretty pictures and data and things like that
And that was fine for a POC. I was a bit uneasy of doing these in like a language I didn't know very well and basically as a hack but then Alvaro came along and Alvaro works on Instagram and They have the same problem Because they use systemd to manage their services So he started looking at these and Alvaro is a better coder than I am and he knows how to use Syson
So he wrote a wrapper using Syson on top of Sd bus If you don't know about Syson is this magical Python thing that lets you call see a C API from Python and expose What you get back in terms of Python objects So using this system we can get we can talk to systemd through D bus and
Talk to it and we we get to interact with real Python objects That translate that internally to D bus calls which is kind of neat especially for prototyping because you get a nice raffle where you can like Directly poke at things and get properties and see how it behaves and see if it makes sense So we're likely going to use this and rewrite my prototype on this we're going to open source this
We I actually have the repo mostly ready So it should be out in a couple of weeks and system on itself We we have to figure out how to release it, but we expect to release it at some point ideally this year So hopefully people will find these useful and I would love to get feedback on whether something like this could be useful or if you
Have ideas of like how you deal with this problem or ways you we could do this better All right. Now, let's move on to interesting stories and case studies So by far the main thing that still Causes random issues that we don't quite understand is the bus in production
the the problem with D bus is that when the demon gets sad or angry system D also gets sad or angry and The problems with the D bum cannot really be resolved without rebooting the machine because D bus doesn't really support Takeover in place. So most of the time when bad things happen You're only resorting the machine which is fine for a laptop, but not fine for a server
The other problem is that if if you get the system to a state where assistant is still mostly working But the connection is severed So system CTL is not working either if it's hanging or if our try failing we like connection to bus failed For us that manifests a chef failing on the box because chef cannot manage services anymore. So it will yell at you
Which is a problem because then people get all the chef failures But they're not really actionable for them because they see oh, but what is this D bus stuff? The other thing we found is that he's actually surprisingly easy to dust D bus and put it in a bad state We had a co-worker was doing tests We use a services and you brought a thing that when you would log in on the box
Your batch RC will start a user service that would do some stuff and then start other things That semi reliably managed to crash it not on every machine on like a good chunk of machines Unfortunately, a lot of these problems are hard to pin down and like I would love to be able to get like, okay
Here's a repro and file a bug upstream But I don't have that I have on X percent of machines. Sometimes this happens and it's really fucking hard to find it so we are Right now we're mitigating this by basically rebooting machines when necessary and trying to keep D bus up-to-date and get bug fixes We're also looking at alternatives There's this D bus broker project. That's
An initial replacement has a D bus daemon coming up to the bus one project that we started looking at and testing on a small Number of machines and it looks promising but I don't have any hard data on whether it works It works well at our scale But I'm definitely interested to see how this how this fares and if he's if he's going to behave
In a more reliable way or in an easier way to remediate in case of issues Let's move on RPM macros. Yes, so we This is one that's more like people problem than technological problems. So we a lot of the stuff we build a Facebook are not
They're not like large complicated Tools, they're like one binary. You have to ship one binary one system DC unit and that's it So we have a tool that you feed it your build config and it spits out an RPM. It makes your spec file It it makes your system. It takes your system service. That's the right macros and here's your RPM, which is fine
The Macros That you get by default by fedora restart your service on package update, which is also fine Except a lot of people are used to the old design pattern France and five For instance six when you do a chef you do package upgrade and then notify your service and ask chef to restart it
If you're doing that and you're also starting on upgrade you're starting twice Which doesn't seem to be a bad thing except maybe your service takes quite a while to start Takes up a lot of resources when you starting or starts talking to the world So this fix for this was it really easy like you I don't know being the tool that you so you can disable the
default restart behavior and you're done the trick was like socializing these and Understanding that this was actually the problem and figuring out how to propagate it And there's a few there's a few things like this that I'll that I have later that are not necessarily Technical issues, but more like getting things to be better understood
another area of interest for us is The journal and logging so the journal for us is set up to only log in memory Logging a small in memory buffer and then feed everything to syslog and we do this because we have a ton of infrastructure based on syslog our own security So we need we need we need syslog to keep going and people are used to syslog people want to be able to do
Tell the chef valid message and see their stuff On the other hand We found pretty quickly that when people actually start understanding what the journal is and start playing with it. They really like it People start using journal CTL and like oh I can filter by things this this actually works
So people started asking us Okay, can I use the journal because I'd like to get more data than just that the shitty 10 megabyte buffer You you can of course you can enlarge the buffer You can make it store on disk When you do that though you end up with that double writing problem because you're writing both tools are syslog and to the journal They're both right to disk and some of our tools are really chatty so you can end up writing like
Megabytes or more per second, which is not ideal if you're IO constrained And it's also an all-or-nothing proposition because you either do this system-wide or you don't do it at all So what we'd really like to get is some way to do this on a per unit basis So people could say for these application I want the journal data to have these buffers and go on disk and be persistent for these to be transient
But we haven't found a good way to do that so for now what we're telling our customers to do is Either take the hit of the double writing or set things in such a way that they silence some of the stuff that ends up Ending up in syslog, but that causes other problems. So it's not really a tenable solution in the long run
Now another fun for some value of fun problem we had was loops So I didn't know this I discovered this by when I hit this problem But if there's a loop system D breaks the loop between a dependency loop between units It would break it by removing an arbitrary unit from the loop
And it tells you it puts a log line like that And we found this out because we boot the boot machine and it would come up and it would be missing 10 files and directories that were supposed to be created by system D 10 files and Then chef would fail because this directory that's supposed to be here is not here and it's like what the hell
So you look at the logs and you see that line, which is okay. That's interesting Why? The way you we debug this was using system D analyze and System D analyze gives you both with chart style plots like time zero the system booth time one System this starts time to the service starts and any gives you dependency graphs
So nice like octopus style graphs with dependencies That's completely useless for us because if you plot it you you end up like the size of this rooming graph paper to get it Out but you can make it plot a small subset and that part is useful Because in this case that line gave us enough clue to know that like SMC proxy, which is a thing we run was responsible
Somehow so we started using allies from there discovered that it has something to do with mounts Found that somebody added to Fs tab an entry to ask the mount these network file system And make it require SMC proxy, which is fine, but if you don't say it was actually a natural file system
So these ended up ordered in a in an order in conflict because it's like you need the file system Which needs the service which is the network but the network needs this and this is this doesn't work So fixing these is adding net def to the Fs tab entry and then we're done but the
the process of finding it was a bit more interesting than that and That's in that's one thing that we also like I wasn't personally aware This was a thing so it took a while to dig through the logs and see that that This line was a problem, and this is what was going on here on a somewhat similar vein Another another fun problem that we had was around transient units and system Iran so we
We have these machines that do builds and testing for like mobile apps and The way they do that is that there's a lot of phones plugged in USB essentially Which and then we run Stuff that does things on the phone and the way the the team that manages does it they use system Iran to start
These processes and they run these using device allow policy. So they only talk to the thing they're supposed to talk And Iran is a lot because we do a lot of tests We have a lot of phones and every time this runs it It makes even use a unit runs the unit does this thing this poses of the unit all good But like USB cables aren't great sometimes stuff breaks
so this can also fail when unit fails system D lets them stick around they stick around in like failed and And they're there Not a problem if it's like one two three five Wait a bit if you end up with like 10,000 units. There's something in in p1 that
Reads this and ends up taking a shit ton of CPU like we saw with 10k failed units 50% CPU usage on the box with 30k goes around a hundred percent and that's when people started noticing because it was kind of interesting So the fix is putting a cron job that cause reset failed Yes
Yes, that's also what I told people to try but like for now what the way the way this was fixed was because people wanted To get a sense of how bad the problem was So what they did was they added a monitoring counter that would tell them how many how many failed units are there and then they? Immediation if it's past the threshold code reset failed, but yet ignoring ignoring the result is likely the better the better option
I need to check if there's an way to do that with system. You're on directly or if they need to write a unit I Just thought about that that we maybe want to improve system D run so that System D run can still extract the failed stuff, but then system you run automatically make sure that the thing goes away
Oh, yeah, that would be awesome. I think that yeah, that would take care of the case here Yeah, because in this case it's failures that are not really actionable Because if it's if it's failed this is it's not because of something we can control
Onto more fun the this time secret related If you have a service standing by default when you terminate it, it kills everything the secret You can change this you can tell it don't kill everything the secret only kill the main process and leave everything else to fend off by itself Which we try to discourage people from doing but people are stubborn and sometimes to their own thing
If you combine this and when that happens If you end up with leftover processes the the secret stays around because there are services in the secret, of course That's processing the syrup so What happens if you then reload? you change the configuration of that service that before was in this slice you make it go to these other slice and then reload system D and
Ideally you'd expect it would apply the change turns out it doesn't if there's a if there's already a secret running with these processes in which which makes sense and Fixing this is easy. You kill everything or you change it to the control group, but this was kind of surprising and it
Because a bit of consternation for people because they they really wanted to use kill mode process for reasons not not particularly interesting here and The fact that the fact that the old unit was sticking around wasn't quite as evident and you basically find out because you run system
CTL status and it shows a stopped but it also shows the secret is still there and then you poke In since of a secret and you see it is there and there's processes running And then the only resort to just kill everything. Yes Yes in this case, yes, the question was is there any for not using kill mode mix in this case
The reason was that the thing The thing we were starting was not actually the the process it was a thing that was point of a bunch of children and Do like hand wavy supervision of these children and they wanted that they could replace the supervisor
But now the children it was something like that So like the solution was redesigned is to not do that because you shouldn't do supervision in Europe your application But yeah, we I don't know if we tried mix specifically But I remember the folks that were that had this system were fairly adamant. They they wanted to use kill mode kill mode process
the way that it designed But yeah, like the for the specific thing there's not much we can do because like the current behavior actually makes sense So it's not it's not a problem. It's just something that's not evident at all Finally for like most stupid and yet irritating bug I found
The escape system this is logic that translates device names and paths to unit names and it uses shell control characters So slash dev slash something becomes def dash something and then escape Now slash escape is also a shell control character So if you take that and then write I don't know systemctl status something and shell out to that without escaping it
You're gonna have a bad time So chef did that which made it interesting where we're trying to write a cookbook to manage Swap devices in some special way and it would like fail Because of a weird shell escape stuff So we fixed it that the fix was like really trivial and it was basically make sure we call a shell escape
This was something that was somewhat It was somewhat embarrassing because it took a while to figure out that was the actual thing that was going on And we're like we should have got this before As part of this we've wrote we have a cookbook that to manage system. It does open source We open source all of our cookbooks there on github
So we had a we had a small wrapper to wrap the system the escape function to get Unit name from path so you don't have to shut out from it So this closes the gallery of horrors So I want to spend a few minutes to talk about like how you do How you do these things that try to not make people too angry because of course
If you look at things from the point of view of the of the service on our the application owner You kind of like your system to be stable and sort of static on the other end You want to get new features and you want to know that your system is stable, but is also getting updates So the approach we found that that works reasonably Well for us is trying to communicate things as much as possible to people and make sure people understand why you're doing this
So announcing core package updates widely and announcing changes that are happening and why they're happening and what's going on with them and Giving people a venue to comment on whether oh this feature is interesting I'd like to know more I'd like to use it or oh this is gonna cause me trouble and
Doing these updates in such a way that people can react to them and give you early feedback when we did When we did one of the system version upgrades It happened to actually cause issues to the same folks that did the phone stuff Because they were doing other secret things and it was fine We could like spin them to the old version for an hour solve it and then move on The other thing we found is the documentation is critical and while I've seen documentation is great and very detailed
It is not great If you don't know what you're doing and you're starting from the place of wanting to learn So we started writing like internally like snippets of like, okay This is the suggested way to do things to do like a basic unit to do these are that and we found that doing this
With the customer use case in mind works better We also encourage people to like follow what's going on with upstream because we don't have all the answers So people can read the source code because this is an open source project They can go on the mailing list. They can look at things and finally we we found that talking and giving tech talks Internally both as a company and both going to a team and sit at a team meeting and talk to them for half an
Hour about what they're doing. What is the system? They saying what you can do with it because a long way making people Happier and and more amenable to to using these and leveraging it and then actually liking it and building cool things with it With this, thank you very much, and I'm happy to take any questions you might have
Minor clarification. He said all of our cookbooks are open source. They are not and our learners will get angry
Yes of our cookbooks are open source some of our core cookbooks. Yes. Thank you, Phil Okay I was curious you were mentioning one of the things that you're having issues with in terms of chef
Integrating with systemd is with system control and things hanging is that because of daemon reloads or something else with With the E kind of lockup other aside from D bus issues So chef shells out to systemctl for basically every interaction it does with systemd
So if I have a unit that say I want this service to be started. I want this service to be enabled I want I'm adding a new unit or changing a unit Reload system all these things end up calling systemctl. So he sent any of these interactions if the system is unhealthy and Cdl either returns an error or hangs will lead to chef failing
What I'm curious is if you know what the root causes on the systemd say pit one side of why those things are hanging We we know it sometimes like the the kernel tty race is one example where we found out what it was In other cases we don't and the it for some we we pin it down to the bus
Most of the ones that are like unknown there. They're like we're fairly sure is something that we the D bus columns But we haven't managed to report in a way that we can actually pocket it more and and get some actual data out of it Yes regarding the issues with RPM database and Inconsistent packages. Have you looked into deploying using atomic OS 3 or similar stuff?
No The then I'll tell you why the main the main thing is that for for system packages specifically We want to try to stay as close as possible to how how upstream is built and our upstream works
Because it makes it makes it a lot easier to keep things to keep things not too alien So people can understand it and when you interact With people outside we might we might well look at that in the future But I think in the in the short in the short to mid term I think we're gonna stick with the with RPM and and just deal with the with the things the other thing we're doing is
Actually trying to make RPM better and we're actually engaging with the RPM developers and the young developers there So hopefully we'll have something there. So well, my time is up. Thank you very much folks