System containers at scale
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Alternative Title |
| |
Title of Series | ||
Number of Parts | 490 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/47371 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSDEM 2020233 / 490
4
7
9
10
14
15
16
25
26
29
31
33
34
35
37
40
41
42
43
45
46
47
50
51
52
53
54
58
60
64
65
66
67
70
71
72
74
75
76
77
78
82
83
84
86
89
90
93
94
95
96
98
100
101
105
106
109
110
116
118
123
124
130
135
137
141
142
144
146
151
154
157
159
164
166
167
169
172
174
178
182
184
185
186
187
189
190
191
192
193
194
195
200
202
203
204
205
206
207
208
211
212
214
218
222
225
228
230
232
233
235
236
240
242
244
249
250
251
253
254
258
261
262
266
267
268
271
273
274
275
278
280
281
282
283
284
285
286
288
289
290
291
293
295
296
297
298
301
302
303
305
306
307
310
311
315
317
318
319
328
333
350
353
354
356
359
360
361
370
372
373
374
375
379
380
381
383
385
386
387
388
391
393
394
395
397
398
399
401
409
410
411
414
420
421
422
423
424
425
427
429
430
434
438
439
444
449
450
454
457
458
459
460
461
464
465
466
468
469
470
471
472
480
484
486
487
489
490
00:00
System programmingScale (map)Time zoneServer (computing)SoftwareComputer-generated imageryProcess (computing)Open setClient (computing)ScalabilityRepresentational state transferInterface (computing)Directed setComputer hardwareHuman migrationInformation securityDefault (computer science)Kernel (computing)LaptopOracleGraphical user interfaceComputer clusterPublic key certificateOperations researchComponent-based software engineeringBackupComputer networkVertex (graph theory)Instance (computer science)Installation artDatabaseView (database)ConsistencySystem programmingScaling (geometry)Algebraic varietySoftware maintenanceSoftwarePopulation densitySingle-precision floating-point formatMedical imagingProcess (computing)Kernel (computing)Computer clusterCartesian coordinate systemFile systemComputing platformArmVirtual machineRun time (program lifecycle phase)Patch (Unix)Computer fileHeat transferClient (computing)Default (computer science)Multiplication signBitVirtualizationType theoryGame controllerVolume (thermodynamics)View (database)WebcamBackupDifferent (Kate Ryan album)Revision controlPlastikkarteInstance (computer science)INTEGRALMereologyRepresentational state transferGraphical user interfaceFitness functionGastropod shellSlide ruleLaptopCore dumpTrans-European NetworksComputer virusBootstrap aggregatingDirection (geometry)NamespaceMessage passingVariety (linguistics)Computer hardwareHuman migrationSemantics (computer science)Information securityKeyboard shortcutEnterprise architectureRootkitFormal languageMixed realityNumberServer (computing)Overhead (computing)Data managementComponent-based software engineeringNetzwerkverwaltungBridging (networking)Insertion lossComputer iconRing (mathematics)Disk read-and-write headInterface (computing)2 (number)Position operatorTrailPlanningDistribution (mathematics)Video game consoleRight angleSystem callDot productHTTP cookieComputer networkMetropolitan area networkPerimeterGodExecution unitAsynchronous Transfer ModeComputer animation
07:13
InfinityConvex hullDifferent (Kate Ryan album)System programmingLink (knot theory)Computer clusterPasswordSingle-precision floating-point formatMereologyElectronic mailing listBitGroup actionArithmetic progressionComputer animation
08:31
Maxima and minimaConvex hullComputer clusterIP addressPasswordLattice (order)Wave packetWebsiteBit rateCellular automatonSubject indexingInheritance (object-oriented programming)CASE <Informatik>System programmingComputer animation
09:14
Execution unit3 (number)10 (number)Smith chartMaxima and minimaIRIS-TPersonal identification numberHill differential equationSystem programmingBefehlsprozessorInheritance (object-oriented programming)Query languageRight angleComputer clusterPasswordArmServer (computing)Enterprise architectureWordOrder (biology)CodeCASE <Informatik>Optical disc driveMultiplication sign1 (number)Forcing (mathematics)Point (geometry)Virtual machineSet (mathematics)SoftwareComputer animation
12:01
Maxima and minimaTorusUniformer RaumPasswordVirtual machineMedical imagingHeat transferConfiguration spaceVolume (thermodynamics)Address spaceArmProper mapContent (media)Computer clusterSinc functionVideo game consoleCASE <Informatik>System programmingComputer networkSynchronizationDifferent (Kate Ryan album)BitComputer hardwareMultiplication signGame controllerTotal S.A.InternetworkingIntelligent NetworkPower (physics)Sign (mathematics)Time zoneComputer animation
15:13
Link (knot theory)WindowHill differential equationState of matterWindowBit rateCommitment schemeMultiplication signComputer animation
15:34
Installation artGraphical user interfaceGamma functionElectric currentLogicVariety (linguistics)Client (computing)Computer networkJava appletInformation securityFormal languageCommon Language InfrastructureInternet forumCloningWebsiteDemo (music)Client (computing)Remote procedure callFluid staticsComputer hardwareKernel (computing)WindowPoint (geometry)Escape characterFiber bundleAuthorizationBinary codeThread (computing)Computer configurationExploit (computer security)BitMultiplication signGoogolMereologySoftwareComputer networkArmComputer clusterFerry CorstenImplementationSound effect2 (number)System programmingAlgebraic varietyServer (computing)Keyboard shortcutInformation securityEnterprise architectureEmulatorFile formatVirtual machineMedical imagingFormal languageBinary fileLibrary (computing)Right angleState of matterStaff (military)Insertion lossUniform resource locatorSpacetimeCASE <Informatik>LaptopNetwork topologyConservation lawMetropolitan area networkGod4 (number)Figurate numberFactory (trading post)Group actionUML
21:08
FacebookPoint cloudOpen source
Transcript: English(auto-generated)
00:06
Our next speaker is Stefan Graber, he is the LexD project leader, and one of the maintainers and one of the core developers, and he's going to talk about system containers at scale. Hello.
00:21
All right, so, let's talk about first system containers and LexD and clustering. So, what are system containers? Well, system containers are effectively the oldest type of containers. They've been around for quite a while, originating with BSD.js, then as Linux v server on Linux,
00:44
it was a patch set from over a decade ago, then Sorari Zones, sorry, obviously, OpenVZ on Linux, which was also an insanely large patch set on top of the Linux kernel. And then, as we were upstreaming things in the Linux kernel, LexC and LexD kind of came
01:05
as the runtime to drive that. System containers behave very much like standard old systems, you run a full Linux distribution, it's not like a single process type thing that you get with Docker and other application containers. You don't need specialized software images or anything, you treat them just like virtual
01:24
machines effectively. They're very low overhead, easy to manage, you can run thousands of them on systems, there is no overhead that just take up physical resources, there's no need for hardware accelerated or anything. And, as far as the host is concerned, it's just a bunch of processes, so you can go
01:41
on the host and you see all your processes, it's nice and easy. Now, what's LexD? So, LexD is a modern system container manager. It's written in Go, it uses the LexC to drive containers. It's got a REST API, a bunch of REST API clients, and as you can see here, you can have multiple hosts running Linux, then the LexC layer to drive the kernel, and then
02:05
LexD on top exposes the REST API, and then you've got a number of clients that can talk to that. More of what's LexD? So, LexD is designed to be simple, it's a very clean command line interface, pretty simple REST API, we've got bindings in a lot of languages to make it easy for people to drive system
02:22
containers through LexD. It is very fast, it is based on, it uses images, so it's no more like creating a root file system with the bootstrap or whatever. It's got optimized storage and migration over the network, it's got direct hardware access because they're containers and we've got nice semantics to pass GPUs, USB devices, et cetera.
02:43
It's secure, so we use all of the kernel namespaces by default. We also use LSMs, like Aparna, we use seccomp, we use capabilities, we use pretty much everything that's at our disposal to make it safe. It's scalable, and that's what we'll see most in this talk. It can go from just a single container on the laptop to tens of thousands of containers
03:04
running in a cluster. As far as what we can run on top of LexD, we've got a lot of images that are generated daily for all of those distros, plus a few more that literally couldn't fit in the slide anymore. So we built for about 18 different distros, about 77 different releases, all combined,
03:24
which ends up being over 300 images we build every day that people can use to run on LexD. You can also build your own, but we've got a lot of them. LexD is effectively on Chromebooks, so if you've seen that Linux feature on Chromebooks, it then gets you a Debian shell that's using LexD.
03:42
So we've got a decent user base through the heart, and that feature includes integration for snapshots, backups, file transfer, GUI access, GPU access, stand card access, webcam access. They really went with it on the Chromebooks. The little piece where LexD is used right now is on Travis CI, so if you run any job
04:02
on Travis that is not Intel 64-bit, so if you use ARM 64, if you use PowerPC, if you use IBM Z, all of those platforms are using LexD containers with an extremely quick start-up time of usually less than two seconds, running on shared systems with all of the security in place, and some of the Cisco Interception stuff that Christian demoed
04:23
has been done partly as part of that. Now for the LexD components, go through this quickly. That's kind of the main things we've got in our API. Clustering is what we'll demo today. I said we're image-based, we've got images, and then image alliances to have nice names
04:42
on images. We've got instances, so those are containers, but these days it can also be virtual machines. That's a new thing we added a few weeks back. We've got snapshots and backups for instances. We've got network management to create new network bridges that you can use for your instances. We've got projects that get you your own individual view on a shared LexD.
05:05
You can have conflicting, there's no more conflict with container names or any of that, so long as they're in different projects. We've got storage with a variety of storage drivers we support, and you can create custom volumes in the snapshots and all that. Some internal bits are mostly to get notified
05:23
when something happens on LexD or for access control. We support doing file transfers and spawning applications directly in containers and virtual machines, accessing console, and publishing containers to images. Now for the main topic of this talk.
05:41
LexD has had clustering support for about two years now. It works, it's really built into LexD. There are no external dependencies. It works on LexD 3.0 or higher. Insertions can just be turned into a cluster member, and you can easily join an installation into the cluster.
06:01
There's really no external component you need for any of that. It works using the same API as you have for a single node. It's got a few more bits you can use through the API to say I actually want something to be specific on this machine. But if your client is not aware of clustering and it just throws things at LexD
06:21
exactly like if it was a standalone node, things will just work. The cluster will just balance things for you, and you'll never know that you're even talking to a cluster. And it can scale quite nicely. So we can run containers on dozens of nodes. We've actually run clusters of 50 to 100 nodes, and they still mostly work. And each of those can run hundreds to thousands
06:42
of containers. So very high density depending on what you're running. We've also added, and that's very recent, it was a few weeks ago, support for mixed architecture. So you can have cluster nodes that are different architectures, and when you ask a particular image to be used to create a container or virtual machine,
07:03
to just pick whatever node is capable of running that given architecture. All right, now for an interesting part of this. Let's see how that works. Okay, so for this, I've got three systems.
07:21
Actually, I need to connect to a third one. Okay, now that it's connecting. So what we'll do is LexD is installed. That's LexD 320 that we released two days ago. So just configure the first node. So do you want to set up a cluster?
07:41
Let's go with yes. Let's do enter its IP, because the link local is not gonna be fun enough. This one. We're not joining an existing cluster because we want to build a new one. Yeah, let's set the password. Let's configure some storage. So let's go barefs.
08:02
Create barefs. That's fine. Was there anything special to do on this one? Yeah, okay, that one is a bit different. So we just need to tell it what's the shared subnets for all of those. So that's my subnet at home.
08:22
Okay. All right, so right now you've got a single LexD part of a cluster, but it's the only one in there. So you can see. Now let's go on to the next one and repeat this thing. It's gonna ask less questions because it's just joining.
08:40
So it wants clustering. It's IP address is dot, I believe. Joining is in cluster is yes. And the other node was on 1646. Okay, so it's asking for the password entered earlier. Yes, everything's gonna go away when we're joining. Size, we don't care. Yes, so we don't care.
09:06
Come on, okay, so now we're joined. We should see that we've got two nodes and things still work. Now to make things slightly more interesting. So those systems were Intel x86, nothing super special, Xeon CPUs.
09:21
Now we've got one that is not Intel x86. So this one is running on 64. Same thing, next to init, cluster, it's fine. This is wrong. I forgot to connect that. So it's actually a nested container
09:40
because I didn't have a spare M64 system. So I'm just doing next to nesting for that one. But I connected it to the wrong network, so I'm just fixing that. Okay, so there it is again. The IP should be right now. Okay, so clustering, yes, name is correct. IP is correct this time.
10:00
Joining is in cluster is yes. And we said it's 1646. Cluster password. Yes, we're cool with that, size and care. Because it's a nested container, it can't create a loop device. I need to actually tell it where the storage is. And that should be the end of that.
10:26
Okay, let's go back to like one of the x86 nodes. So now if I list the cluster, we see we've got three. And one of them is AR64 instead of Intel. Now let's show some stuff at it. So just create a container called C1.
10:41
This one is, I didn't specify what architecture I actually want. So it's gonna kind of surprise me. LexD will pick whichever it consists of to be the least busy server and just schedule a container there. So it's probably gonna be on one of the x86 ones. Yeah, it's on Edfa, which is one of the x86 servers. Okay, now let's do another one.
11:02
Let's do CentOS. I think, my guess would be it's gonna go on Notero, which is the other x86 system. And then the third one would most likely be scheduled on ARM. Let's see. Yep. It doesn't have an IP yet,
11:21
but that's gonna fix itself, there we go. And let's do Alpine. That name is already taken. Oh no, they're just out of order, nevermind. Okay, so C2, and that's gonna be on ARM. Okay, so now if I go on there. So from, I keep forgetting that Alpine doesn't have a bash.
11:42
There we go. So yeah, I'm just executing a command in there, and we can see it's running on AR64. So LexD is doing all the API forwarding for us. So I'm on one machine, talking to the cluster, and just go on to the right node and kind of query it there. The other thing that's somewhat interesting
12:01
is we've got a tool to convert a system into a container. So that's what we've got here. That VM01 is a CentOS 7 VM. That's just doing nothing, but it's there. We've got a tool called LexD P2C that can take the address of the cluster,
12:22
will ask for the same password we set, and will then transfer the entire thing over the network into LexD, creating a new container for you. That's the entire content of the system. That's gonna take a little while, so I'm just gonna let it run. While that's going on, I wanna show the new, cool thing we've added.
12:44
So all of the LexD networking, storage, and configuration bits, because our containers act so much like virtual machines, the same concept really applied to actual virtual machines. So we figured, well, why not just allow running virtual machines as well, using the exact same storage and configuration?
13:00
So that's what we've got. I can do launch, and notice I've got just an extra thing at the end, so just dash dash VM. That's pretty much the only difference. In this case, I don't want it to go on ARM, because since that ARM host is a container inside a VM on ARM, there's no way I can run a VM inside there.
13:20
But the x86 machines are running on physical hardware, so those will be just fine. We do support running VMs on ARM, but you need to run on the actual hardware, which is not the case here. So the images are a bit larger, because we have our machines, but still downloading, unpacking that, creating the storage volume on Burefess, I think it was this time, yeah.
13:42
And now if you do console, so console works fine on containers too, just to show you. If I do C1, there we go. So on the container, you get attached to the console, and on a VM, what did I call it?
14:03
Yeah, not fun. Well, VM01 is the one I'm transferring in as a container. I would have expected console V1 to actually function. Where did it go? Is it just because it's confused? Square that for a turn, yeah. Oh, okay, all right. All right, so this is a bit picky.
14:21
So we see the same thing, the VM was booting. I touched a bit late. Let's just go back to this guy here. We can just launch a second one of them. Let's see if this one behaves properly. Creating V2, come on, you can do it.
14:42
Oh, yeah, that one takes, that one takes a tiny bit, because since it's built by anything within the cluster, and the node it picked doesn't have the image yet, it's doing an internal cluster transfer of the image. So it's not pulling it from the internet again, but still needs to move it around. It's optimized, it uses Burefess send receive in this case.
15:00
It doesn't use our sync or anything. It's pretty optimized, but. So we can see we're in the bootloader, and then booting the VM. And lastly, just to show you that, I'm hoping that VM is done transferring, the CentOS thing, it is. So if we go here, we can start VM01,
15:20
which is a container that was created from the CentOS system, and there we go. All right, how am I doing on time?
15:40
Okay, we're two minutes behind, yeah. So LexD is available on obviously Linux, that's why we run, but we also have a Windows and Mac OS client, so that you can talk to a remote LexD, if you've got a Raspberry Pi or an Intel NUC or something you want to run LexD on. If you want to contribute to LexD, it's written in Go, it's fully translatable. We've got client libraries for a bunch of languages.
16:02
It's Apache 2.0 license, there's no copyright assignment or anything in there. We've got a good community you can work with, and we've got usually a bunch of smaller issues, good starting points to contribute. That's it. If we've got some questions, we can take them now.
16:20
And we've got stickers towards the exit when you leave, if you want any of those. Questions. Sorry for speaking so fast, but as it turns out, 20 minutes is pretty short. Thank you very much. Two quick questions. What do you think about running Kubernetes
16:40
inside the LexD containers, the LexD containers here? Yeah, so yes, we can do it either way around. Some people have been, you can run Kubernetes, especially things like API servers and stuff inside LexD containers, no problem. You can even run kubelet inside LexD containers, because we support nesting and we support running Docker inside LexD containers,
17:01
so that's possible, and people have done it before. And you can actually do it the other way around as well, where LXE is a community project that implements a CRI for Kubernetes, that then drives LexD containers. So you can kind of do it either way, but yeah, it's possible. And then question, like,
17:21
why kata container exist if LexD is secure? Why what? Why kata container exist if LexD is secure, like, system container? It's always, so that depends on people, depends what they trust. We've seen that hardware is not particularly safe either. Some people think that relying on the Linux kernel
17:41
for the entirety of the security story is quite fine. Some people think that VMs are the only option. Some people think that you need both. In fact, like on Chromebooks, Google is on purpose using both. So they're using a virtual machine layer and then running only unprivileged containers inside there, so that if the kernel is busted, you're still in a VM. If the VM is busted,
18:00
you're still an unprivileged user in a user namespace, because we've seen exploits against both in the past. Recently, we've actually seen more CVEs and security issues around both the hardware bits of authorization and some of the hypervisor stuff than we have against the Linux kernel as far as escape of containers. But, I mean, there's always a risk,
18:20
and that's kinda up to you what's fine with you, what's not. Combining both is also the slowest option, but it's there, and some people have done it. Okay, two short questions. First, do you support foreign containers on the host,
18:41
I mean, running ARM containers on x64 or something like this? Sorry, I'm not, I'm asking a question. Foreign containers. Oh, yeah, okay. So you, architecture emulation on the system, effectively. So like ARM on x86, yeah. So we did that in the past. I did implement support for that in LXC
19:02
almost, I don't know, five, six, seven years ago using QEMU user static. It is possible. It is not pleasant, and it's not something we want to ever have to support again. The main issue being that the QEMU user static layer cannot handle properly threads or netlink
19:20
and some other things like that. So we have to do a very, very weird container where most of the binaries were indeed ARM, but the init system and the network tools were x86, which works, but is really, really weird. And as soon as you start doing updates and stuff against those containers, just quickly get
19:41
into a really weird state. So not something we're particularly keen on revisiting at this point. It is possible you can make it work. You could create a custom image that bundles QEMU user static at the right location, and with the right bin formats configuration, LXC will let you do it, and it will just work. But not something we want to support. Actually, I do just a bind of host system works.
20:03
I just want to see a native implementation. But anyway, and second question. You talk about clustering. What about roaming clustering nodes? If you remove your notebook, for example, with a node somewhere else. Yeah, so that part is kind of tricky.
20:22
So LexD has my move support to move containers around. That works fine, but usually you want to stop them. If they're running, then you just create a QEMU that Adrian talked about earlier, which can work in some cases, but also tends to fail with a lot of modern software. As far as the storage bits, one thing that's interesting
20:41
is that LexD does support Ceph as a storage driver. And so if your container is backed by Ceph, at least if a node goes down, you can always start the containers back up anywhere else you want, because the data is on the network. So that's kind of what we have there. And I think we're out of time. Yep, we're out of time. Thanks very much. Thank you.