State of the rkt container runtime - TIB AV-Portal

State of the rkt container runtime

00:00

3

Chaos Computer Club e.V.

López Galeiras, Iago

Formal Metadata

Title

State of the rkt container runtime

Title of Series

All Systems Go! - The Userspace Linux Conference 2017

Number of Parts

47

Author

López Galeiras, Iago

Contributors

All Systems Go!

License

CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/37948 (DOI)

Publisher

Chaos Computer Club e.V.

Release Date

Language

Producer

All Systems Go!

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

rkt is a modern container runtime, built for security, efficiency, and composability. It is one of the container runtimes supported by Kubernetes but the current implementation (“rktnetes”) doesn’t support the Container Runtime Interface (CRI). The work-in-progress CRI implementation is called rktlet. This presentation will give an update on the rkt project, what new features were implemented recently and what’s coming up. It will also give an update on the state of the rktlet: what features are missing and what workarounds should be removed before it becomes a complete implementation of the CRI.

All Systems Go! - The Userspace Linux Conference 201715 / 47

1

39:56

Which network to use when - Socket Intents

2

10:33

What's in a container? The OCI Answer

3

21:02

What If Component xxx Dies? Introducing Self-Healing Kubernetes

4

42:09

Virtualization: what changed in the last decade

5

29:54

Using systemd for containers @ Facebook

6

23:16

Using BPF in Kubernetes

7

25:45

Updating Embedded Systems - Putting it all Together

8

31:12

Update on new WiFi daemon for Linux

9

30:39

Unbreaking reloads: strategies for fast and non-blocking reconfiguration

10

28:42

The IoT botnet wars, Linux devices, and the absence of basic security hardening

11

22:17

Tango with systemd

12

40:25

Systemd @ Facebook - a year later

13

46:43

Synchronizing images with casync

14

26:54

Streamlining systemd's code and safety

15

24:46

State of the rkt container runtime

16

25:07

Software updates for connected Linux devices: key requirements

17

25:44

Simulate hardware for integration testing

18

23:53

Securing Home Automation with Tor

19

37:03

Reproducible Builds - where do we want to go tomorrow?

20

24:00

Rust memory management

21

33:48

Really crazy container troubleshooting stories

22

41:12

Portals, dynamic permissions in Flatpak

23

06:59

All Systems Go! 2017: Opening

24

18:09

Network troubleshooting in heterogeneous cloud environment with Skydive

25

26:57

Modern deployment for Embedded Linux and IoT

26

38:18

Meson and the changing Linux build landscape

27

19:45

Landlock LSM: Towards unprivileged sandboxing

28

30:40

Kubernetes for toasters?

29

22:16

Kube-spawn: testing multi-node Kubernetes clusters on Linux systems

30

24:37

Journal as a Storage and Other Adventures in User Session Recording

31

40:35

Introducing Bluetooth Mesh

32

39:27

Insecure containers?

33

14:29

Incremental Adoption of Open Services with Habitat

34

24:37

High-performance Linux monitoring with eBPF

35

42:52

Getting Started with Habitat

36

37:34

Fix, forget, or forge a new path?

37

49:45

38

28:34

Creating your own 1password clone

39

31:51

Containers without a Container Manager, with systemd

40

17:12

Containers: What Did We Learn?

41

25:49

Cockpit: A Linux sysadmin session in your Browser

42

14:32

All Systems Go! 2017: Closing

43

40:46

Cgroupv2: Linux's new unified control group hierarchy

44

16:08

Building containers all day

45

29:38

Building a secure boot chain to userland

46

29:15

Azure networking integration challenges

47

25:54

A gentle introduction to [e]BPF

Automatic playback

Speech

Text

Image

00:00

System programmingRun time (program lifecycle phase)Quantum stateSoftware maintenanceFile formatImplementationComputer-generated imageryIntegrated development environmentStandard deviationPhysical systemCommon Language InfrastructureProcess (computing)DemonBinary fileData managementConstraint (mathematics)Computer hardwareVirtualizationScheduling (computing)System callInterface (computing)Maß <Mathematik>View (database)Different (Kate Ryan album)MereologySoftware maintenanceProcess (computing)Physical systemVideo gameMedical imagingRevision controlCycle (graph theory)Mobile appInstance (computer science)Computer iconDefault (computer science)Control flowCircleFile formatCubeService-oriented architectureImplementationLevel (video gaming)NamespaceFigurate numberCuboidInformation securitySpacetimeComputer configurationOperator (mathematics)Density of statesUtility softwareBitData managementPolar coordinate systemCartesian coordinate systemSoftware repositoryOrder (biology)Key (cryptography)QuicksortFocus (optics)Server (computing)Execution unitDrop (liquid)Ultraviolet photoelectron spectroscopyMathematicsCore dumpPlanningMachine visionSign (mathematics)DemosceneMilitary baseDirect numerical simulationTranslation (relic)DiagramRemote procedure callStationary stateRun time (program lifecycle phase)Reading (process)Interface (computing)Integrated development environmentLine (geometry)Library (computing)Standard deviationForcing (mathematics)Mathematical singularity1 (number)Software developerCloud computingStapeldateiSeries (mathematics)Service (economics)Sound effectTwitterOverlay-NetzLatent heatDemonSinc functionSoftwareUniform resource locatorLimit (category theory)MiniDiscComputer hardwareVirtualizationComputer animationXML

09:53

Term (mathematics)System programmingFile formatBlogNamespaceInterprozesskommunikationDisintegrationRevision controlVariable (mathematics)Run time (program lifecycle phase)Software testingImplementationMountain passDemo (music)Error messagePropagation of uncertaintyConnectivity (graph theory)TouchscreenRevision controlComputer fileSoftware testingCartesian coordinate systemNetwork topologyDemo (music)outputMobile appMappingFunction (mathematics)CASE <Informatik>Streaming mediaMessage passingOperator (mathematics)Client (computing)Software bugLoginProcess (computing)Standard deviationConfiguration spaceComputer configurationDefault (computer science)BitExecution unitLevel (video gaming)File formatStatisticsProjective planePrimitive (album)Scripting languageCodeDemonINTEGRALMixture modelGraph (mathematics)SynchronizationRun time (program lifecycle phase)NamespaceBefehlsprozessorHeat transferData storage deviceAlgorithmMereologyDistribution (mathematics)Line (geometry)Conformal mapGene clusterInterface (computing)Integrated development environmentVariable (mathematics)Physical systemStructural loadVirtual machineSoftwareSoftware repositoryMathematicsMedical imagingVirtualizationShared memoryDirected graphSymbol tableTerm (mathematics)Identity managementFigurate numberPropagatorCubeWebsiteSoftware developerSuite (music)CuboidXML

19:40

System programmingDemo (music)Service (economics)InformationGastropod shellPasswordDatabaseRun time (program lifecycle phase)Decision tree learningVirtual machineDemo (music)Cartesian coordinate systemLibrary catalogMereologyCore dumpGastropod shellElectronic mailing listInformationXML

21:02

Windows RegistryLibrary (computing)System programmingDemo (music)Data typeService (economics)Electronic mailing listData structureMedical imagingNamespaceFront and back endsFunction (mathematics)Debugger

21:34

Bookmark (World Wide Web)System programmingStack (abstract data type)Online chatTouch typingLibrary catalogComputer animation

21:53

Bookmark (World Wide Web)Online chatSystem programmingNumberRight angleCoprocessorSemiconductor memoryData conversionLimit (category theory)Medical imagingMultiplication signSound effectMoment (mathematics)Metric systemPlanningFile formatForm (programming)Scaling (geometry)Cartesian coordinate systemOpen setXMLUML

24:39

Demo (music)System programmingXML

Transcript: English(auto-generated)

00:06

Yeah, so in this talk, I'm going to talk about Rocket. What's the state of Rocket? Where are we? What's being new in Rocket? And what's the future of Rocket? And we'll also talk about RocketLit, which is the part that integrates

00:22

Rocket with Kubernetes. So first, some things about me. So hi, I'm Iago. I live in Berlin. And I'm one of the Kimfolk co-founders. You probably know Kimfolk by now because we are sort of organizing this conference. But for those of you who don't know us,

00:41

we are basically a for-hire Linux engineering team that focus on core cloud infrastructure technologies. So yeah, one of those cloud infrastructure technologies is Rocket. And we've been working a lot on that in the past. And yeah, we've also retaken this work recently.

01:01

And yeah, I'm currently a maintainer of Rocket and the RocketLit. And yeah, there's my GitHub and my Twitter account if you want to find out more about me. So the plan for today is I'm going to give a brief explanation of Rocket. So what is Rocket? And then I will explain how it works, so some of the Rocket internals.

01:22

I'll do the same for the RocketLit, what it is, how does it work. I'll explain what's new and what are the new features we've been working on recently. And then I will explain what's missing. So yeah, we have a vision for the future. So Rocket, very briefly, is a modern, secure,

01:41

composable container runtime. Modern, as in we try to take advantage in the latest technologies in Linux, specifically like user namespaces or overlay effects or whatnot. And secure, as in we try to be secure by default. So all the security options are enabled by default.

02:00

If you want to disable it, you have to do it explicitly. So you're conscious of what you're doing. And composable, because we try to play nice with the rest of the system, we have a good integration with systemd. It's not required, but we have a very good integration with systemd. And yeah, composable.

02:21

And ultimately, it's just an implementation of the App-C spec, specifically the parts of image format and execution environment. Yeah. So App-C is what we started when we started Rocket to specify the image format and the execution environment. And now it's kind of not maintained,

02:41

but Rocket still uses it heavily. So I will give a brief look at what App-C is. So yeah, it's a standard application container specification. It's an open specification, and it defines some associated tooling. Yeah, there's some repos under the App-C organization,

03:00

like AC Server or Docker to ACI. CNI was part of the App-C spec, but now it's moved to container networking since it's being used by a broader audience. But for our purposes, we're going to talk a bit about the spec. So it defines four things. Image format, which is what an application consists of.

03:21

So it's basically a tarball with some manifests associated with it that says the resource limits you want, the executable you want to execute. Then there's the image discovery part, which is how we translate an image name to an actual URL where the artifact of this image is. And for that, we take advantage of the DNS system.

03:41

So we have a translation from the name to a DNS namespace. Then it defines pods, which is how applications are grouped and run. And then it defines the executor, which is what the application is going to see when it's executed. So yeah, Rocket is a simple CLI tool.

04:00

So we don't have any daemons that are mandatory. It's written in Golang, and it's basically Linux only. And it is self-contained, as in we don't require any weird dependencies, just your typical dynamic libraries that you have installed in all the systems.

04:21

And it's in its system and district agnostic, although, as I said, we optimize heavily for systemd. So yeah, being a simple CLI tool, it means that the applications run directly under the spawning process. So there's some diagram there. You can see that there's the spawning process, which

04:41

can be bash, or systemd, or our unit, you name it. Then under that, there's Rocket. And then under that, there's the pod and the two apps in this example. So you can see these stage 0, stage 1, stage 2. So the execution of Rocket is divided in three stages. So basically, stage 0 is the CLI tool.

05:02

So it takes care of discovering, fetching images, and rendering them on the disk. Stage 1 is basically the actual container. So that sets up the exec environment for pods. And it manages the process's lifecycle. And it applies resource constraints, too. And stage 2 is the app itself, so whatever you're

05:21

running, nginx or any app. So one cool thing about stage 1 is that it's swappable. So that means there's different implementations of stage 1, which is pretty cool. So the full one uses Linux namespaces and secrets for isolation, so what you call a Linux container. And it's based on systemd and spawn and systemd.

05:42

So we basically run a systemd and spawn container. And inside, we run a systemd instance that manages all the processes in the pod. There is also a KVM implementation, which uses hardware virtualization for isolation instead of containers. So that's useful when you want a bit more security. And it's based on QEMU nowadays, and also

06:02

systemd inside the pod. So the part that manages processes is shared between those stages. And then there's also fly, which is basically a no-isolation stage 1. It's just a chroot, and its purpose is to take advantage of Rocket's image handling

06:22

and signing and all that stuff. There's some more, but these are the main ones. So I'm going to stop talking about Rocket, and I'm going to introduce Kubernetes for those who don't know it. Yeah. So basically, Kubernetes is a container orchestration system.

06:41

So it gives a developer an API endpoint where it can define which containers the developer wants to run. And the developer doesn't have to care about where they run or how this is done. It just says, I want to run five instances of this container. And then Kubernetes takes care of putting those containers

07:01

in actual hosts. So you can see there that there's this funny icon on the top. That's etcd. And that's basically a distributed key value system that offers the cluster a full view of a cluster that's consistent among different hosts.

07:23

So that's a distributed primitive that's needed to implement Kubernetes. And those worker cubelet boxes are basically hosts. So on each host, a cubelet is running. And the cubelet is the one that is responsible for creating the actual containers inside the host.

07:40

So yeah. Then the cubelet basically nowadays implements an interface called the CRI. Well, not implements, but defines an interface called CRI. And CRI defines an interface where there are different methods that express the things that you

08:02

want to do with containers. So for example, the cubelet can say, run pod sandbox. So a new sandbox will be started in a host. And the cubelet can say, OK, now create this container in that sandbox. And then start it, or stop it, or get the status, or yeah, all those stuff.

08:21

So this is implemented via gRPC, which is a remote procedure call by Google. So cubelet will call one of those methods. And there will be something on the other side listening and actually doing the actual operation. So these things are called shims. And yeah, there are several of them.

08:41

There's CRI container D, which has recently been developed a lot by Google. There's CRIO, which I think it reached version 1.0 in the last week or two weeks. And then there's Rocketlet, which is a rocket implementation of the CRI. And yeah, it's bold because that's what I'm going to talk about.

09:00

So how does it work? Yeah, so you have the cubelet that connects through gRPC, as I said before, to the Rocketlet. And basically, what the Rocketlet does when the cubelet requests a new sandbox, it's called systemly run. And it just starts a systemly unit that runs Rocket, the Rocket sandbox. And then the Rocketlet communicates with Rocket

09:24

by just running Rocket commands. And it can ask Rocket to add containers or remove containers or whatnot. So in order to implement this, we had to change the design of Rocket a bit because it was basically immutable design.

09:41

So if you start a pod, it will be this way. And that's it until you stop it and you start a new one. So we had to do some changes. And these changes are what we call the app experiment. So yeah, as I said before, the CRI operates in the sandbox and container terms, not just in pods.

10:01

So Rocket had to be redesigned. And we added a new subcommand for that, which is the subcommand app. So you will have a Rocket subcommand that matches each one of the methods of this CRI interface which I was talking about before. So with this, Systemd was pretty helpful for us because this maps pretty neatly with what Systemd does.

10:24

So for example, Rocket app add will just add a new unit file to the pod and load it. And Rocket app start will just call systemctl start, basically. And yeah, this is experimental because it's a pretty big change to Rocket.

10:41

So for now, it's under an environment variable called Rocket experiment app. So you have to specify that before calling Rocket to use these features because if not, they will be hidden. Another thing that we had to deal with is the logs. So being that we run Systemd inside the pod,

11:03

the natural way to handle logs is to just use the journal. And that's what we were doing. Rocket was logging through Journald. The problem is that the CRI interface wants logs in a custom text format. It doesn't want the journal format.

11:20

So the initial solution kind of hackish way to solve this was to have basically another container on each pod that translates the journal logs to the CRI format. So that had some problems because it was, yeah, first of all, very resource consuming. So you don't want to have a side code container.

11:40

You just want to have the containers you want to run. And it was kind of unreliable. So sometimes the logs were not there and you will get weird error messages which nobody likes. Moreover, we wanted to implement things like attach to the standard input, the standard output of a container and Journald doesn't have anything to, so we can do that.

12:05

So the proper solution is what we call the attach experiment. So it's another experiment. And that uses a component called IOTTYMAX and that allows us to write directly to the journal, to the CRI file format.

12:23

So this is a simplified version of how it works. So this is the rocket process tree. So you can see systemdnspawn and then the actual container which runs systemdnsite. So basically every application will connect its standard input, standard output, and error

12:42

to this IOTTYMAX daemon. And then that will pass through the standard input output to systemdnsite, to an external process, or into a log file. So in this case only standard output and error.

13:01

So this is the simplified version of the CRI format. It's just very simple, just a date, what stream it is, and then the message. So yeah, this allows us to attach and to write logs in the format directly without having a sidecar like we had before.

13:23

So basically those were the two things that were the most important things that we had to change for Rocket to work well with Kubernetes. Yeah, I'm gonna talk about the recent developments like in the last two, three months in Rocket. So first we added a feature that a client requested

13:43

to share the host IPC namespace because they wanted to do some operations on the IPC namespace, and you need to share it with the host so other apps can actually do that.

14:00

So Michael Lee Alban implemented that. And then we need to do a lot of fixes to make Rocketlet possible. Yeah, a lot of bug fixes, especially with the experiments. Yeah, experimental, so they weren't really working well. And we need to do some integration work between the attached experiment and Kubernetes because it needs some to put the logs

14:23

in a specific path, and we need to add some configuration options to do that. And another thing that we did is we switched the default KVM flavor from LKVM, which was not very reliable for us, to QEMU. So yeah, now it works much better. So yeah, what comes next in Rocket?

14:42

One thing that I think Casey started but didn't finish is updating the CNI to version O6 because we're using O3 and there's a lot of new features in newer CNI versions that we need to update. And we need to stabilize the Rocket experiments

15:01

and remove all these environment variables. So we need to do a lot more tests and remove those so they can be used out of a box. Yeah, we experimented a bit with using RunC to set up the stage two runtime. So what we were doing until now was using system to unit files, and those provide

15:22

a lot of primitives to do a lot of resource limiting or yeah, mainly resource limiting. So RunC gives a bit more flexibility though. And we started looking at using RunC instead of, yeah, I think Casey over there was starting to work.

15:43

Yeah, so that would be pretty cool. And we did some experiments with CA Sync, which for those of you who don't know that, it's a new project by Leonard Puttering. And yeah, I'll talk a bit more about that in next slide. And also general bug fixing,

16:01

because there's always bugs, right? So yeah, the CA Sync tests are explained here in this graph. So basically CA Sync is a mixture between the rsync algorithm and a content addressable store. And that allows you to distribute images,

16:22

for example, in a much more efficient way, because it will only distribute the parts that are missing. And so we wanted to test that with container images and see what the improvements were. So you can see that the blue line is basically just downloading everything.

16:41

And the red line is what CA Sync transfers on the network. So you can see that there are some spikes and that's when a lot of things change in the image. So in that case, you have to resend everything basically. But overall, you can see an improvement. So that's very cool.

17:01

There's the repo there where we did the experiments where you can find scripts to reproduce them. So yeah, we'll probably look a bit more into that in the future. Yeah, so now about Rocketlet itself, what comes next? So we need to implement missing CRI features. For example, the attach thing

17:21

that IoTtyMux makes possible is not really working as of now. We think it was not that important because it's usually used as a way to debug your container. But yeah, that needs to be implemented. We need to also implement this CRI container stats API.

17:41

So Kubernetes uses a thing called CAdvisor to get stats from the containers. That's not a really nice code and it doesn't work as well as we hoped. So there's a new definition in the CRI for container stats. So individual runtimes can figure out a way

18:03

to get the stats and then send it back to Kubernetes. Performance is something we haven't really been looking at. But yeah, we think it's not such great performance. We see a lot of CPU usage, so we need to improve that. And as of now, Rocketlet only supports Kubernetes version 1.7

18:22

and we want to support 1.8 because that's the latest stable release. We need to improve error propagation. So when a rocket command fails, we don't give very nice error messages. Sometimes it's just an exit status. So that's something that, yeah, it's very high priority in our list. And we've been starting to run the conformance test

18:43

for Kubernetes and CRI and there's some tests failing. So we need to investigate those. Okay, so yeah, I want to do a demo. So at Kimfold, we developed this tool called kubespawn. And what that does is it lets you

19:00

spawn multi-node clusters in your machine by using containers. So each container would be a node, basically. So instead of virtual machines, like it's usually done, like with the Minikube, we use containers and yeah, that allows us to do tests more quickly and with less resource consumption.

19:21

So I have a Rocketlet-enabled Kubernetes cluster running in my machine. Had some troubles, yeah, some minutes ago. I was trying to make it run and it didn't work, but I think now it's working. So yeah, I'll share my screen.

19:41

There you go. Yeah. So is this size okay? Yep, okay. So yeah, let's see. So I have an application called microservices demo,

20:02

which is something that Weaveworks developed. It's just a very nice microservice example application. So there's a lot of containers that work together to bring you a sock shop. So it's yeah, a sock shop. So we can see that there's a lot of pods running

20:23

for carts, the database for the carts, a catalog, yeah, payment, a lot of stuff. And yeah, I can show cluster info dump grab container runtime to convince you

20:40

that this is not running Docker. I can go to the machines because they're just containers. I can use machines in dl shell and for example, this one. And yeah, there's no Docker whatsoever. There is the rocket, yeah, this, oops, sorry, list.

21:07

These are the images running. And yeah, so this is basically working and I can get a list of the services of the namespace of the sock shop.

21:21

And yeah, I can see that the front end has, it's an output service. So that means that if I access any IP of the cluster on this port, I will get to the front end service. So let's just try that. So yeah, I can use directly the name of the container, which is pretty cool.

21:42

And yeah, this is the shop. Doesn't look so nice here in this small screen, but yeah. So you can see a catalog and yeah. So this is fully running in Rocket with Rocket Lit. And it's a pretty complicated application

22:01

so I think it's pretty cool. Yep, that is basically it. So yeah, thanks. And if you have any questions, please ask. I'll try to reply.

22:30

Are there any plans to directly support the OCI format, the open container image format in Rocket? So right now we support the OCI format

22:41

by being able to run Docker images and we do a conversion there to the ACI format that's Rocket internal. And we would like to have OCI support, but I think nobody has had time yet to look at it. And people that are using Rocket Lit at the moment,

23:00

they use ACI images, so it's not really a priority for us now. But yeah, that will be definitely very nice to have. I think there was a question over there, but I don't know. Ah, same, okay.

23:31

So without the attach, are there also, so about the things that are not implemented that integrate with Kube? Is there a currently support for HPAs and things like that?

23:42

And when the custom metrics come out in 1.8 for HPAs, will it support that as well? Yeah, I don't know what HPA is. Can you elaborate a bit? It's the auto scaling. So like the auto scaling for memory and number, not just replicas, but like the memory limits and all of that stuff native to Kube.

24:02

In 1.8, there's gonna be custom metrics involved. So we see requests per minute increase and we can increase the number of processors or the amount of the upper limits and the initial start containers. Is Rocket gonna support those as well? Yeah, I guess so. I mean, when we try to implement 1.8,

24:22

we will look at those issues and we'll try to support them which should be technically possible. Okay, thanks very much.