Container Run-times and Fun-times
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 50 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/43092 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
All Systems Go! 201846 / 50
1
4
11
12
13
14
16
17
19
23
24
25
29
30
32
34
35
39
40
41
43
44
50
00:00
SpacetimeAbstractionProcess (computing)QuicksortSpacetimeGroup actionComputer configurationDifferent (Kate Ryan album)File systemLevel (video gaming)BitMereologyVirtual machineRule of inferenceConnectivity (graph theory)ThumbnailMultiplication signRun time (program lifecycle phase)Computer hardwareOperator (mathematics)Interface (computing)Physical systemNamespaceKernel (computing)Principal ideal domainSoftwareStandard deviationData managementAtomic number1 (number)Streaming mediaComplex (psychology)Black boxMathematicsComputer fileGraph (mathematics)Execution unitFacebookNetwork topologyVolume (thermodynamics)AreaCellular automatonLoop (music)Decision theorySpherical capoutputMechanism designLatent heatRight angleLecture/Conference
05:49
Computer-generated imageryVolumeBuildingSystem programmingAbstractionWorkloadInformation securityProcess (computing)Local GroupService (economics)DienstekompositionSoftware maintenanceInterface (computing)Kernel (computing)Design by contractMaß <Mathematik>Web serviceRadio-frequency identificationGame controllerBuildingInformation securityProcess (computing)Volume (thermodynamics)Service (economics)Bus (computing)Multiplication signPhysical systemConnectivity (graph theory)WorkloadCuboidQuicksortGroup actionMereologyIntegrated development environmentGraph (mathematics)Kernel (computing)AbstractionRight angleConfiguration spaceInterface (computing)Design by contractMultiplicationCategory of beingRandomizationSoftware bugScripting languageNumberData managementInheritance (object-oriented programming)Ocean currentToken ringCore dumpSoftware maintenanceCodeWeb serviceWide area networkExecution unitEntire functionOperator (mathematics)DebuggerBlack boxExpandierender GraphPrice indexRun time (program lifecycle phase)Computing platformComputer animation
12:21
Process (computing)Control flowWeb serviceRootPortable communications deviceComponent-based software engineeringRun time (program lifecycle phase)Service (economics)System programmingPersonal digital assistantFormal verificationInformation securityBuildingBinary fileExecution unitMetadataStructural loadService (economics)Web serviceUnit testingFile systemCASE <Informatik>NP-hardMereologySoftware testingMultiplication signQuicksortHeat transferPortable communications deviceExecution unitData managementSystem callRun time (program lifecycle phase)Proper mapResultantPrincipal ideal domainGraph (mathematics)Computer fileConsistencyPattern languageProcess (computing)BuildingCognitionRootScheduling (computing)Remote procedure callNamespaceVirtual machineScripting languageDifferent (Kate Ryan album)Link (knot theory)NumberControl flowInstance (computer science)Flow separationWorkloadCartesian coordinate systemMetadataConfiguration spaceBinary codeEndliche ModelltheorieFrequencyKeyboard shortcutConnectivity (graph theory)Radio-frequency identificationInformation securityGame controllerMiniDiscPhysical systemProgram slicingGroup actionRight angleState of matterScalabilityGame theoryFormal verificationComputer animation
18:52
RootPortable communications deviceTotal S.A.MetadataSystem programmingWeb serviceInstallation artFacebookMenu (computing)Local GroupWeb serviceExecution unitBinary codeUniform resource locatorService (economics)Physical systemComputer fileRun time (program lifecycle phase)InformationLatent heatDirectory serviceDifferent (Kate Ryan album)MehrplatzsystemFile systemExistenceJSONComputer animation
20:01
Web serviceDesign by contractSystem programmingMereologyFacebookMathematicsQuicksortDecision theoryProduct (business)WorkloadPhysical systemDefault (computer science)Web serviceMathematical analysisStatement (computer science)CASE <Informatik>MereologySingle-precision floating-point formatComputer fileService (economics)FacebookProcess (computing)Unit testingSoftware testingDifferent (Kate Ryan album)ImplementationInformation securityPlug-in (computing)CircleRight angleSquare numberBuildingOrganic computingData managementAbstractionDomain nameValidity (statistics)Line (geometry)Cartesian coordinate systemRootKeyboard shortcutBitGame controllerMultiplication signMiniDiscScaling (geometry)Fluid staticsIntegrated development environmentDesign by contractData conversionConnectivity (graph theory)Execution unitOcean currentCodeState of matterNamespaceExistenceComputer animation
26:17
Source codeCASE <Informatik>
26:45
System programmingFacebook
Transcript: English(auto-generated)
00:06
OK, welcome, everybody. My name is Lindsay Salisbury. I work for Facebook. I'm going to talk today about container runtimes and fun times. This complicated part of the infrastructure. Specifically, I'm going to talk about systemd inside
00:22
containers and how we are transitioning to that model. But first, I want to give you a little bit of background. Sorry. OK. First, I want to give you a little bit of background on what Tupperware is. Tupperware is our custom-built container system.
00:41
We started this around 2011 or so. This was before Docker, before the days of Docker, before containers were kind of an industry standard thing or industry thing. There's a bunch of custom stuff that we did, that we had to do in order to manage our infrastructure. It's pretty big. It runs pretty much most of Facebook. There's a handful of places in the infrastructure
01:02
where it doesn't run, but for the most part, almost all of the important stuff runs on top of Tupperware. We run millions of containers. We have billions of containers have been created and destroyed using these technologies I'm going to talk about, the components that are part of the system. We haven't talked a whole lot publicly about Tupperware
01:21
in recent years. We're going to start doing more of that. I've given a few talks about some of these areas, specifically about the container runtime, because the space is pretty interesting. We get a lot of discussions around various technologies that we use and various technologies that other people use. I think the cool part about this tech
01:41
is that we haven't figured out and solved all these problems yet. There was a talk earlier today where they gave the big container space in the industry, and it's just like a big eye chart that has a bunch of different stuff in there. So there's a lot of different options of ways to do stuff. I want to share and talk about how we do some of these things.
02:01
Specifically, I work on a team called Tupperware Agent. This is the host level runtime. This is the thing that actually sits on each machine that does the work of setting up and tearing down all the container components. This would be like Rocket or the Docker part of the world. It's basically an interface with the kernel and process manager. It's just the simplest sort of way
02:21
to think about this thing. We try to write this in such a way that it just does what it's told. It's not trying to make decisions. It's being told what to do, and then our job on the agent is essentially just to execute those operations as they are given to us. This allows us to create a relatively stable system and also sort of understand the complexity
02:40
of the behaviors that we have inside as we set up containers and as we tear them down. It also allows us to create some pretty clean abstractions and some pretty clean testable components that we can put together. So I want to talk a little bit about the components that are used, and these are sort of more of the conceptual components.
03:01
They are specifics, but there's a couple of sort of like high-level pillars that we're interested in discussing. And the first one obviously is Linux. This is the kernel. This is a hardware interface essentially, right? The Linux kernel runs on the machine and gives you access to the actual physical hardware that you're running on. It gives us a process abstraction. This is the thing that's running work, right?
03:22
This isn't anything, this isn't rocket science. And then it gives us the namespaces. So we mostly use PID and mount namespaces. We don't use network namespaces. There was a talk earlier about the Katmon D stuff and the fact that we don't use user namespaces and network namespaces. PID and mount are kind of the big ones that we worry about. Those are what most container infrastructures
03:41
are sort of trying to do is isolate processes and mounts for file systems. So C group two is another big pillar component. This is a hierarchical abstraction. So it allows us to have resource management in a sort of a nested tree. This gives us a pretty clean abstraction
04:00
for managing groups of processes. And we mostly in the agent, we think about things in C groups now. We don't really think about stuff in processes or sort of individual components. We think about them as large chunks of sort of black box component, black box processes. And I'll get into a little bit more detail about why that is in a little bit.
04:22
And then we also use butterfs as a file system. This is pretty much across the majority of the fleet. We support both loopback images and on-host sub-volumes. So the benefits of one or the other, there's some subtleties there that I could get into a little bit, but pretty much the rule of thumb
04:42
is that loopbacks are useful for shipping things as sort of like atomic units. We don't have to care about what's inside that thing. We can just ship the whole file, mount it as a loopback, and then operate on it. The host sub-volume is interesting because it allows us to receive butterfests send streams that come from the build process
05:00
that gives us basically not a full file system. It's just a stream that allows the file system to be reconstructed as a sub-volume on the host. This has some interesting implications with IO as we're applying those sub-volumes and also with cleanup and deletion mechanisms, which I won't get into in this talk, but if you wanna ask questions, you can. So we can layer images as sub-volumes.
05:22
So this is a sub-volume of like a base file system, and then we take a snapshot and we make a bunch of changes and then we can ship the base and the new image sub-volume below that independently of each other, or we can ship it together as one thing. It gives us a lot of flexibility in how we manage the deployment of file systems and how we think about
05:42
sort of sticking those file system pieces together. So when a container is started, it's a read-only sub-volume. That is a snapshot of the original image that was deployed. So this gives us the ability to have one sub-volume that all containers can be inherited from. We create snapshots and we do modifications to that,
06:01
and we aren't actually modifying the original volume of the original image. This is the read-write ephemeral snapshots. And then the other big pillar component we use across the fleet is systemd. So on the host, this is a process manager on the host, the agent executes commands through systemd for setup and teardown. We use the Dbus API for that now.
06:22
We're starting to sort of expand that infrastructure more. And we're basically using this to replace bash and spawn indications. We don't really want to have to go out and shell out to stuff. We'd rather just have systemd do the work for us. We send a Dbus request across and it launches a process and we can look at and interrogate systemd
06:41
to determine what the status of that thing is. We also use systemd and spawn for butterfs image building. So this allows us to execute commands in a controlled environment in a sort of super trut, or a name-spaced trut, and do a bunch of operations. We get sort of like a hermetic environment
07:01
that we can use to make sure that the image build works as we expect. So that's on the host. Inside the container, we basically care about one thing, and that is a process manager. When we start containers, we start PID1, essentially. We start sbin in it. And we use a process manager for a number of reasons,
07:22
because we can run multiple container workloads and multiple container processes. We have services that want to run 30 different things, or they want to run one thing. It gives us the ability to group workload services together, but it also gives us the ability to run infrastructure services. So we have a whole bunch of stuff that runs inside the containers that we provide as a platform for the other teams
07:44
that are running services that our SSH, our syslog, we have debug tools, we have the ability to do security stuff. This is getting security tokens and things like that. This is stuff that has to run next to every workload, because we need it in order to interface
08:00
with the system appropriately. And the benefit of this is it gives us, from the agent side, from the runtime side, we just have to worry about one thing. We just start a process. We PID1 inside of a C group, and we can monitor the C group as a whole thing. We can monitor PID1, and we assume that the process manager inside that container is doing all the right stuff to manage the processes that are running inside there.
08:22
So it sort of reinforces the abstraction of a workload, right? We have this big thing that we, sort of a black box to us. We don't really care what's inside of it so much. We get a request to start it. We spin the thing up, and we monitor and make sure that that PID1 process is still running, and we have some additional things to sort of introspect and look inside that.
08:43
So for a long time, we haven't been using system to you. We've been using BusyBox as our init system inside containers. So the way that we put services together is we have a lot of bash that gets sort of like munged together in a startup process. There's no real coordinated bring up or shut down
09:00
of these containers. We hope that people write good bash, and we hope that people write things in the right order. They start the things that are needed before the other things that are needed, but we actually don't have any way to enforce that those things happen correctly. There's a lot of poor signal handling. The way that we have been shutting down containers for a long time is we blast a signal out
09:22
to basically the entire process group, and we hope that they shut down in the right order. A lot of times they don't. We had to introduce this concept called a kill command, which allows us to basically, people can give us arbitrary bash snippets that will go kill the right thing, hopefully. This is somewhat problematic because it's super difficult to debug.
09:41
It's super difficult to see what's going on inside there. There's really no way to test these things in isolation. And service composition gets super difficult because what we have to do is ensure that people are starting things in the right order, and we can't guarantee that with this current system. And maintenance is really too hard.
10:01
If we want to add infrastructure components, we have to modify core parts of the system that are a critical path code, and it basically makes it super dangerous to start adding things into the container like that. So what do we really want? Well, what we really want out of a process manager
10:20
is we want an interface for interacting with kernel abstractions. We want to be able to talk to something that will set up processes and set up the services, set up the things that are running inside there cleanly, and basically do it in a predictable way. We need a contract for defining services. We need a way that we can allow our users to define how they want to run stuff.
10:42
We want orderly startup and shutdown so that we get things started in the right order. If we have to reorder something, we just change a config and we restart the thing and everything starts up properly. We want to be able to shut down cleanly so we don't have to just blast signals out and so that things turn off correctly. We want predictable behavior, obviously.
11:01
We want to be able to see that stuff works consistently each and every time. We don't want to have a bunch of weird random bugs because somebody introduced a tick mark in the bash script that didn't get escaped properly and ends up breaking a whole bunch of stuff. That's happened a number of times. And the other thing we really want to be able to do
11:21
is have flexible service composition. And what I really mean by that is we have a bunch of these infrastructure services that we need to put together. Certain jobs require certain services. Some jobs don't require them. Some jobs require a lot of them. Some jobs require a few of them. And we don't have a great way right now or we haven't had a great way to put these pieces together so that things behave predictably.
11:43
So enter systemd inside the container. So again, this is an interface for interacting with kernel abstractions. We really like this because it allows us to start processes with all the properties that we want with like the Dbus API. We get service definition with units.
12:01
So units gives us basically a clean interface and a clean API that users can write to. They can define their services in a way that we can parse, we can understand, we can validate, we can test. We can ensure that everybody is doing it in a consistent and similar way. We can start processes that are configured properly.
12:20
So if we want to make sure that it's in the right slice, if we want to make sure that it has the right cgroup limits, we can do that nearly atomically because we just make a Dbus API call and when that thing returns, we have a process that we can start looking at and managing and we can do this inside a container. We want to be able to manage dependencies properly. Service B needs to start before service A
12:40
needs to start after we initialize the container. We do some security setup. We start SSH and things like that. We want orderly and controlled container startup and shutdown. So again, this gives us the ability to make sure that when we start a container, we know exactly what the end state's gonna be. We can predictably debug it.
13:01
We can actually understand what the ordering should be before we actually execute and then compare it against the actual results. We want proper signal handling. We don't want to have to blast everything out, blast the signal out to everything. We want to be able to just tell PID one to shut down and it will do all the right work.
13:20
The consistent service definition. So what this really means separate from the unit file is that we have the ability to have things defined in ways that are similar so that if we have a bunch of services that have to run as infrastructure services in a container, we have sort of a common pattern that we can use
13:41
so that we don't have a whole bunch of cognitive load to have to like reparse and re-understand everything for each service. This sort of links into transferable knowledge. We use systemd on the host. So if we use systemd on the host and we use systemd on the container, when you get into a container, you don't have to relearn how all of this stuff works.
14:02
In the BusyBox world, it's a completely different infrastructure. So you're on a host, you learn how to use systemd on the host and then you get into a container and it's just completely different and you have to write all new tools in order to interact with and monitor and interact with that thing. And another big part is testable services.
14:22
So we have a hard time testing services that are composed together with BusyBox because you don't actually know how the thing's gonna behave until you put it inside that BusyBox container and test it or try running it and see what breaks. And then you have to do that across maybe a large number of services or a large number of instances
14:40
in order to actually start seeing some issues. The nice thing about systemd and containers and with the unit files is that we get the ability to test those things separate from the container. We can actually test them outside the container, unit test it, make sure the thing works as we expect and when we put it into the container, it does what we want. And if it doesn't do what we want, we actually know exactly where we should go look
15:00
for this thing. So that leads into what we started doing with systemd services is a thing we call composable services. So it's a cute little name that's basically an application of the portable services concept. So each of these services are defined as individual self-contained components. They have their binaries, they have potentially all their dependencies
15:21
and they have a service file that goes along with them. These things are bind-mounted into a container file system at a particular path. They're composed at runtime with systemd so that we don't have to have a custom image build or we don't have a custom build with that service inside of it and then deploy that thing separately. We can actually have one common runtime file system
15:43
or runtime container that we can deploy out to a bunch of different services or different jobs and then they take the services that they want enabled and they just sort of plug them into the side with the config in the spec and then the thing just gets turned on. So the cool part about that is we can update them independently.
16:02
We can update our base container runtime infrastructure or file system without having to incur the cost of users having to rebuild all of their services as well and it works with a truth or not. Portable services, currently it will create a truth at the root of the portable service path so all the dependencies are contained in there. That works really well for us for doing things
16:22
like modeling the way Docker images work or the way Docker containers work where all the services are lumped together. We also have cases where we don't have those dependencies because we statically compile most of our binaries so the only dependency we have is on our C++ runtime, our ELF runtime and so we just need to make sure that thing's in place
16:42
and we can run the thing without having to actually run it as a truth. So there's some use cases here that are pretty interesting for us. We call these sidecars. This is for like log handling, right? So when stuff gets written out to disk, you need to have something that goes through and dezips it or ships it off or puts it somewhere. There's lots of different cases here. We actually have a number of different log handling cases
17:02
that we have had requests for and this infrastructure gives us the ability to satisfy those requests without having to incur a bunch of new infrastructure. Compliance verification, we're a publicly traded company and so we have various regulatory requirements that we have to meet and this gives us the way to actually do that in a way that is scalable.
17:21
Security, periodic cleanup, cleanup temp files, cleaning up stuff that's downloaded, that jobs download and doing remote file system mounts, Gluster, NFS, any kind of thing like that, some kind of FUSE file systems but also it gives us the ability to do workload collocation so users can actually compose their services together,
17:41
schedule a container to run on a particular job or on a particular machine and they end up with three or four services running together inside the same process namespace, inside the same mount namespace and what this, the difference of how that worked with BusyBox was people had to basically write a whole bunch of nasty bash scripts
18:00
in order to make their stuff run together so they'd like spawn processes, put them in the background and then when the container would die, maybe the process didn't die or they don't even know whether the process is running or not because they have to roll their own process management. So how does a composable service work? So we build a package with a binary and a unit file.
18:22
The agent provides metadata into the container about the service that should be invoked or that should be included. We bind mount the package into the container at runtime and into a particular path and the generator, we have a generator that enables the configured service to start at runtime and so we do this early on in the start process
18:41
and we can daemon reload the thing to sort of, if we change the metadata, we need fixes, we can daemon reload the systemd and it will restart the, re-enable the service in the right place and then prop it. So here's a little snapshot of an experimental portable service I put together. Just is what the file system looks like.
19:00
You know, there's just a path that's mounted in. We have a meow binary systemd directory that has the service file and the timer and an os release file that has some specs and information in there. We have a unit file that has everything you can do in units, it's wanted by a multi-user target. We can, we're gonna support the ability
19:21
to have different kind of targets that people can specify. So you have a very fixed startup or a very fixed ordering of when these services can be enabled. Yeah, and then the thing runs when you spin it up and it just does its thing. You can see that the unit file's loaded into a temp file system location with run.
19:42
One thing we really like about the way systemd sets this run stuff up is that it gives us a very specific and explicit place where we know runtime data exists. So we try not to modify anything in etsy because etsy is about what comes with the build, what comes with the image when it's built and then run is where we do all the stuff that's runtime specific.
20:01
So there's a bunch of benefits here. It gives us this concept of service plugins so that we can actually have a bunch of different teams writing different services that get pulled in into running containers where we as the Tupperware team don't necessarily have to manage and maintain all these sidecar services. We can outsource that to other teams that have more domain knowledge
20:22
and we give them essentially like a clean abstraction like an SDK to build against. Gives us a predictable behavior. We can understand how the thing's working. We can understand how the startup works. We can understand what bind mounts are needed, what components are needed inside that composable service.
20:41
Gives us a defined contract that we can hold people to and also like unit test and validate. We can lint on these things and make sure that they're correct. And most importantly, it gives us the ability for these services to be testable outside of containers in controlled hermetic environments. So why is all this important?
21:01
We have a pretty big system with a lot of moving parts and so there's a lot of things changing constantly and without having clear defined APIs and clear defined contracts, it makes it very, very hard to manage and maintain this thing as it grows and as it continues to grow. But also we have lots of people working on this thing. So we have a lot of code changes
21:20
and we have a lot of people that are sort of poking at these systems. And by having a clear definition of how you build a service that is self-contained and then composed into the container, again it gives us the ability to put the weight on other teams. They can control their own destiny and they can ensure that when they write something it's actually gonna work in production.
21:40
It makes it predictable so that we have a system that we can reason about, we can write tests on, we can make predictions about whether the system will fail in certain cases. And more importantly, it makes it fully supportable so that we can actually scale this thing out and add more and more services and sidecars and have different kind of implementations
22:01
that we as the one team that's running this particular infrastructure maybe haven't thought of. We have a lot of different people working on lots of different stuff. This gives us the ability to actually scale it out not just technically but also the human organization part. Okay, that's it, thank you. You have any questions?
22:29
You've got a, yeah. So you mentioned that you have SSH running in the container, I think. Yes. How does that work given that SSH has a boatload of dependencies?
22:41
Are users required to provide the container configured in a certain way? The SSH is configured based on how we as the infrastructure team decide and the security teams decides to set it up so that users actually don't have to configure that particular component. But what about all the dependencies like PAM and so on? That comes along with it. It's part of the base image build,
23:01
it's part of that we deploy out to all the containers. So it's just included by default. It's not a question more of a comment.
23:23
I would just like to congratulate Facebook on making something reasonable. Because actually the whole other part of the world is stuck in this mantra of single process, single container.
23:40
And what it created is a pathology of, let's say, Kafka inside of the Docker. When Kafka is not designed to work as a P1, actually it's not designed to work, but the whole other discussion. And Docker is not designed to control that process.
24:01
So right now we are stuck with the situation when I have to ship the Kafka in the Docker to the Kubernetes because C-level management demands that because somehow Kubernetes right now bumps the stock price. And Facebook done something reasonable.
24:23
Thank you. So along those lines, the interesting thing about these concepts, I don't have time to dig into all the possibilities that we can go to, but the capabilities conversation that we had earlier, the interesting thing about the running everything as root problem is that using this way
24:40
where we have a process manager inside that container that's in its own process namespace that has the ability to launch services that can set capabilities on the various services that are executing inside those containers is we can actually make decisions about how we want to limit capabilities for the workloads that are in there. We don't have to make the blanket statement that nothing can run as root. We can actually run a whole bunch of stuff as root,
25:01
and we can run one or two things as a different user with a different capability set inside of that workload. And we give the users and our service owners the ability to choose that for themselves rather than sort of forcing, what is the phrase? Forcing a circle through a square,
25:21
I don't know what it is, yeah, something. Question, we have like one minute, so. How do you manage secrets inside applications? How do we manage secrets? So we have a process by which they get sent through a secure channel into the agent, and the agent writes them into a secure place in the container, and then they're basically
25:43
like only in RAM, they're not written to disk anywhere, that kind of thing. I think we're done, but can we take one more, maybe? Or two more. Could you talk a bit about any policy or static analysis that you do on the unit files themselves?
26:02
Or is that a free-for-all? Right now, at the current state of things, it's basically like the possibility of doing static analysis and linting exists. There is some analysis and linting done on, I believe, unit files that exist on the host in some cases, but it just gives us the ability to do it.
26:21
We treat it like source code, right? So it's something that we can actually parse and look at and validate and start flagging things for. We haven't built a whole lot of infrastructure to do that yet, but again, this is about the possibilities that are available to us using this. Nope, okay, we're done. Sorry, ask me after. Cool, thank you guys.