Chef in Strange Places
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Untertitel |
| |
Serientitel | ||
Anzahl der Teile | 50 | |
Autor | ||
Lizenz | CC-Namensnennung 3.0 Unported: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/43138 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
All Systems Go! 20181 / 50
1
4
11
12
13
14
16
17
19
23
24
25
29
30
32
34
35
39
40
41
43
44
50
00:00
Minkowski-MetrikFormation <Mathematik>SystemprogrammierungFacebookFormation <Mathematik>Virtuelle MaschineProgrammierumgebungKonfigurationsverwaltungClientATMServerOrdnung <Mathematik>ComputerarchitekturNetzbetriebssystemDefaultQuick-SortPunktMailing-ListeGruppenoperationResultanteCodeProdukt <Mathematik>BitFokalpunktBrowserRichtungGleitendes MittelPhysikalisches SystemVorlesung/Konferenz
01:33
Formation <Mathematik>ATMMini-DiscCodeSystemprogrammierungFacebookServerSoftwaretestProgrammierumgebungFunktion <Mathematik>Virtuelle MaschineProzess <Informatik>Formation <Mathematik>SkriptspracheATMProdukt <Mathematik>MathematikSoftwaretestServerFacebookPhysikalisches SystemLoginSystem FGleitendes MittelHeegaard-ZerlegungMechanismus-Design-TheorieClientDienst <Informatik>Mini-DiscDirekte numerische SimulationBenutzerbeteiligungOrdnung <Mathematik>Mailing-ListeDualitätstheorieQuick-SortLesen <Datenverarbeitung>StellenringOpen SourceRechenzentrumCMM <Software Engineering>SoftwareentwicklerCodeNetzbetriebssystemTUNIS <Programm>Computeranimation
03:50
SystemprogrammierungFormation <Mathematik>ImplementierungClientVirtuelle MaschineGebäude <Mathematik>Physikalisches SystemFormation <Mathematik>Gebäude <Mathematik>SystemaufrufBildschirmfensterCodeGruppenoperationClientQuick-SortSoftwaretestProgrammierumgebungNormalvektorMini-DiscMultiplikationsoperatorFacebookPhysikalisches SystemEndliche ModelltheorieAdditionRechter WinkelDualitätstheorieMailing-ListeMechanismus-Design-TheorieImplementierungSurjektivitätNetzbetriebssystemOrdnung <Mathematik>DifferenteMAPMathematikProdukt <Mathematik>Regulärer GraphVersionsverwaltungSampler <Musikinstrument>Konfiguration <Informatik>KonfigurationsraumZeitzoneRechenschieberServerKette <Mathematik>Migration <Informatik>SinusfunktionComputeranimation
08:21
SystemprogrammierungFormation <Mathematik>StereometrieÄquivalenzklasseServerCodeProgrammierumgebungPlug inBestimmtheitsmaßSystemplattformSoftwarewartungFormation <Mathematik>Kette <Mathematik>KontrollstrukturMultiplikationsoperatorATMClientMathematikEinfach zusammenhängender RaumSystemplattformSoftwaretestSoftwarewartungÄußere Algebra eines ModulsService providerHilfesystemBitPlug inVirtuelle MaschineFirewallProgrammierumgebungNotebook-ComputerServerFitnessfunktionEndliche ModelltheorieRechenzentrumCASE <Informatik>GrenzschichtablösungQuick-SortMailing-ListeDefaultBildschirmfensterOrdnung <Mathematik>TermCodeProdukt <Mathematik>FunktionalElektronische PublikationTropfenSkriptspracheOvalKonfiguration <Informatik>StellenringKonfigurationsraumSpeicherabzugAuswahlaxiomTaskEinsPhysikalisches SystemComputeranimation
13:53
SystemprogrammierungSoftwareentwicklerNotebook-ComputerProjektive EbeneFacebookKette <Mathematik>CodeMultiplikationsoperatorCASE <Informatik>Quick-SortZahlenbereichFormation <Mathematik>SoftwaretestSoftwareVerschlingungServerPunktKonfigurationsraumProdukt <Mathematik>Metropolitan area networkProzess <Informatik>RechenschieberAggregatzustandLineares Gleichungssystem
16:10
Systemprogrammierung
Transkript: Englisch(automatisch erzeugt)
00:07
So my name is Zeil, and I'm here to talk about how Facebook uses Chef Solo. So I'm a production engineer at Facebook. My teammate gave a really good talk about a bunch of the work that we do on the operating systems team. But my focus of this talk is mostly about Chef.
00:24
So most of this talk is gonna be about why and how we use Chef Solo. In order for that to kind of be palatable, first we need to talk a bit about how production Chef works, and then finally we're gonna talk about some of the results of this effort to make Chef Solo better.
00:42
So just as kind of a quick background, Chef is a configuration management system. There have been a few talks about Chef already, so I'm not gonna go into too much detail. But Chef code is organized into cookbooks and roles, cookbook being something you would use to install a particular thing, like Apache, and then a role you can use to group cookbooks
01:01
together. And then the run list is the entry point for a Chef run. You would, say, use this to configure a lamp environment on a machine. And Chef, sort of the default way that people use Chef is in an HTTP client server architecture where you have a Chef server that's serving cookbooks and roles over HTTP, and then the clients can request that in order to perform Chef runs.
01:22
So Chef also ships with a tool called Chef Solo. Chef Solo is used to run Chef in a serverless mode. Rather than running Chef in this client server architecture, you can have the cookbooks and roles present locally on disk and have the Chef client use those rather than talking to the HTTP server.
01:42
In the Chef documentation, this is called local mode. It's kind of either local mode or Chef Solo. If you're reading through the documentation, you'll mostly find local mode. But for the purposes of the talk, I'm just gonna refer to it as Chef Solo. So the way ProdChef works at Facebook, we have a pretty mature ProdChef's setup
02:01
that is based around this sort of dual run list. For every machine at Facebook, we have a pretty homogenous environment. They all run CentOS 7, as my teammate Davada mentioned earlier. And we allow teams to customize how their Chef run works. The way this works is we have a run list that's split into two halves. We have the base role that does all the base operating system stuff and sets up system D,
02:25
Chef itself, things like cron and yum and other things that the operating system needs in order to run. And then the teams can provide their own Chef code via this tier role that runs whatever
02:40
they want. This could be something that sets up HHVM, which is our web server, or MySQL, or whatever. So we have Chef servers distributed throughout our data centers. There's a DNS service discovery mechanism that allows any machine in our fleet to look up what the nearby Chef server is and run Chef using that Chef server.
03:00
And we have a bunch of tooling around this. Chef Cuddle and Taste Tester are particularly important for this talk. Chef Cuddle is a script that will run Chef, babysit the Chef process to make sure it runs successfully, and then make sure that the output of that Chef run, both the exit code and the logs, go someplace useful. And Taste Tester is a tool for testing Chef changes that will spin up a development Chef
03:22
server, push your changes to it, and then you can configure a production host to use that development server rather than one of the production servers. Both of these are open source, which I'll talk about later. So overall, we have a really mature production Chef setup. We've spent years kind of fine tuning this and making it work really well for this
03:42
homogenous CentOS 7 fleet that we have. But none of this uses Chef Solo, right? So where does Chef Solo sort of enter this story? So the reason we started using Chef Solo was actually because of Instagram. When Instagram was acquired, it was a pretty large deployment on AWS.
04:01
They had their own version of Chef. They had their own version of their Chef code. It didn't overlap at all with our own internal tooling. So we basically had these two different versions of Chef in production that we needed to make work somehow. In addition, they were kind of given instructions to move into Facebook
04:23
containers, which would be Tupperware, and they were required to do this pretty quickly. So that, overall, led to them sort of cutting some corners in a way that allowed them to make this deadline quickly, but they still wanted to be able to use a lot of the tooling that Instagram was already comfortable with.
04:41
So they decided to use Chef Solo sort of as a stopgap between what we were doing in production and what they were doing in AWS. So that's kind of where Chef Solo enters this story. So the initial implementation of the Chef Solo toolchain for Instagram used the same run list everywhere. So that sort of dual stage run list that I mentioned earlier,
05:00
they just totally threw that out. At the time, Instagram was using, had fewer than half a dozen different workflows. So for them, it made sense to ship all the Chef code everywhere and run all of it, and then the Chef code would sort of toggle itself on or off, depending on where it was running. So in order to get all the cookbooks onto disk on every environment
05:24
where it needed to run, they would use a package. We have a mechanism for rolling out sort of tarballs over torrent, and they decided to use that as a quick way to get this up and running. They forked Chef Cuddle so that it would download this tarball before it ran Chef
05:41
and then use Chef Solo rather than Chef Client to do a Chef Solo run instead of a normal Chef run. And then, as I mentioned, the Chef code itself would inspect the environment, determine whether or not it needed to execute that piece of code, and then just return if it didn't need to execute whatever that code was. And to test Chef changes, you would build a package locally
06:02
using a tool that they had built to build these packages, and then sort of SCP that onto a production environment and just run Chef using that package rather than the one that it would download. So this is pretty different from how ProdChef works. So to make matters worse, we had another team that started off
06:23
doing pretty much the exact same transition right after Instagram finished. So they did pretty much the exact same stuff I had on the last slide. They forked Chef Cuddle, or rather, used the fork that Instagram had and started using Chef Solo using a package. In addition to that, our build system team started using Chef Solo
06:41
to cut down on their startup times for their containers. At the time, their containers took a really long time to start up because the build system needed to install a bunch of build tools inside their containers, so they decided to switch to a long-lived container model where they would run Chef Solo to keep those tools up to date
07:01
rather than restarting the container every time. And another team started talking to us on the operating systems team about using Windows VMs. So this was kind of a wake-up call for us on the operating systems team because none of our Chef code had any sort of support for Windows. Sort of everything we were doing was really invested in CentOS 7.
07:21
So this was pretty far outside our comfort zone. So just a quick recap. About a year had passed since Instagram had started their migration. There were now three teams using Chef Solo. All of them were using sort of their own Chef code. At the time, there were about three different ways of managing Yum configs. So if we wanted to change some option on the Yum servers,
07:40
which we maintain on the operating systems team, and they consume, we would need to go find where they had configured their Yum comps and change that. They were also using their own flavor of the tools. Two of the teams were using Chef Cuddle Solo. Proud Chef was using regular Chef Cuddle. And then there was another team that was running it in a different way.
08:01
And one thing that was particularly painful for us was, whenever we wanted to make a Chef change, we would need to test it in three or four different ways because each team had their own testing workflow. This was a huge cost for us on the operating systems team because we need to be able to make and test Chef changes really quickly. And we just couldn't do that.
08:21
So we decided to sort of invest in the Chef Solo toolchain. So the first thing we asked was, what can we reuse about the production workflow and sort of contribute that to Chef Solo to make it better? So Chef Cuddle is the most obvious choice. They had forked it to begin with, so there was already some common bits.
08:42
So the core bits of just running Chef and making sure that it runs successfully and logs someplace useful are really good. Like Chef Solo and ProdChef both need that functionality. But Chef Solo also needed to be able to do other things. In particular, it needed to be able to configure what options you ran Chef Client with,
09:01
most notably the local mode option to toggle Chef Solo. And also it needed to be able to download a package before the Chef run. We couldn't do that within the Chef run because then you get this sort of chicken and egg problem where Chef needs to install itself in order to run. So we rebuilt Chef Cuddle. At the time, sort of when we started this, it was a Bash script.
09:21
We rewrote it in Ruby and wrote it with a plugin model so that we could add in or sort of like drop down a plugin file that Chef Solo would use to configure this extra functionality. We also really wanted to use Tastetester. As I'd mentioned, testing costs were really significant for us. Tastetester already has a mode where you can run it, use one Chef server,
09:43
and then use that one Chef server to test multiple different machines. We didn't really see a reason why we would need to change that to test Chef Solo. You still just use the one Chef server and then configure either prod or Chef Solo. It shouldn't matter. And we also wanted some way to configure a default run list
10:02
for all the environments that were running. This was something we saw was really beneficial for production because we could control sort of what the base OS was doing. And we wanted to be able to do that on all the places where we use Chef Solo. So we also knew that we couldn't use the Chef servers because at the time the Windows VMs did not have access to them.
10:24
They were running in a more isolated environment because this was sort of our first exploration of Windows and we didn't really trust Windows all that much. So in terms of distributing the code, we would use a package pretty much the same way that Instagram had been doing it with one modification. Rather than shipping all of the Chef code
10:41
for all environments in the package, we would ship just the Chef code that one environment needs and try not to include anything extra. This means that we don't have any dependencies on the Chef servers. So it also means that it's really easy to ship this package into isolated environments like the Windows VMs. And we sort of wrote a Chef Cuddle plugin
11:02
after we had rewritten it in this plugin model that would download the package before the Chef run and then sort of unpack it and make it ready for the Chef run. So this kind of provides a problem now. In order to build that package, you have to know what the run list is and you need to know that in advance of a Chef run ever happening. So what we did was we provided a tool
11:22
that allows teams to request a package be built. They provide us a run list and a target platform and then we include whatever the role team here is, the run list they give us, whatever that is, plus a default role that's dependent on what their platform is. So the Windows VMs will get their own default run list
11:42
and CentOS 7 containers will get a different one. So the outcome of all of this work, after about a year, we were able to onboard all three of the teams that were using Chef Solo onto this platform. We were also able to sort of combine Taste Tester and Chef Cuddle,
12:00
which were two of the most important ones, from ProdChef. So ProdChef and Chef Solo were using the exact same code to test and run their Chef code. With some minor modifications. As we sort of developed this toolchain, we came up with several new use cases that were able to use this toolchain without really us having to modify it.
12:21
So that was things like employee laptops now are starting to use this model of Chef Solo to run Chef. They had been doing something closer to how ProdChef works in the past, but this fit really well for them because it's really easy to ship a package onto a laptop than it is to expose TCP through firewalls.
12:42
We also use this to manage our phones in our data centers by running Chef Solo on the Linux machines that are connected to the phones and then using the USB connection to twiddle the bits on the phone. So this really reduced the maintenance burden for us on the OS team.
13:02
When people came to us to ask for best practices or help with developing their own Chef code, it was much easier for us to help them now that we were using this common platform. Before, when people were using their own tools, it was really difficult for us to provide people advice because we didn't know what they were doing. And it was much easier to test Chef changes.
13:23
This made it way easier for us to maintain pieces of common infrastructure like systemd and yum and much easier for others to contribute to those pieces of infrastructure because one team wouldn't need to learn some alternative test workflow to test their changes for some other team.
13:40
So that's all. Any questions? I don't know if we have time for questions. Yeah, we are officially into the coffee break now, so we can have questions. Okay. Questions or coffee, I guess. I heard in a former talk that you also use Chef on rack switches.
14:02
How does that work? So Chef on rack switches is using the prodchef workflow. There are a number of our cookbooks have to kind of take into account that we can't restart them easily. So a bunch of our internal Chef code has sort of special cases that say,
14:20
don't do things that would require a restart of the network or a restart of the host. But they run basically prodchef. Is prodchef going to live a long time or is the goal to eventually move to a homogenous system? No. One of the reasons that we did this was now that we can share this tool chain for Chef Cuddle and Taste Tester between the two,
14:42
there's no reason to merge them further. Prodchef is really well oiled as it is, and one of the goals of this project was to be able to support Chef Solo without impacting the prodchef workflow. Thank you. Okay.
15:01
One more question. One more, yes. So you were mentioning this tool that you use to test cookbooks. How does the workflow actually work of that, so when you start from your laptop and you want to modify a cookbook? For Taste Tester? Yes. So generally, like most developers that work on Chef use a development server
15:20
that's in one of our Facebook racks. That development server will run, they'll run the tool called Taste Tester, which is on GitHub, by the way. I'm sorry, I didn't have a link to it in the slide. But that tool will spin up a Chef Zero server locally on their development server, and then they can, it will SSH into a production server that they choose,
15:42
reconfigure the Chef config on that host to point at their dev server rather than a production server. Dead man switched there after a month, or after. Oh, yeah, yeah. So there's also a, on every host in the fleet, we run a five-minute cron job that will check to make sure that a host is tested or not, and if it's gone past some expiration time,
16:02
it will reset it back to production. So we don't get hosts that are kind of like left alone in test state. So thank you, Seal.