Container Live Migration
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 490 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/47510 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSDEM 2020293 / 490
4
7
9
10
14
15
16
25
26
29
31
33
34
35
37
40
41
42
43
45
46
47
50
51
52
53
54
58
60
64
65
66
67
70
71
72
74
75
76
77
78
82
83
84
86
89
90
93
94
95
96
98
100
101
105
106
109
110
116
118
123
124
130
135
137
141
142
144
146
151
154
157
159
164
166
167
169
172
174
178
182
184
185
186
187
189
190
191
192
193
194
195
200
202
203
204
205
206
207
208
211
212
214
218
222
225
228
230
232
233
235
236
240
242
244
249
250
251
253
254
258
261
262
266
267
268
271
273
274
275
278
280
281
282
283
284
285
286
288
289
290
291
293
295
296
297
298
301
302
303
305
306
307
310
311
315
317
318
319
328
333
350
353
354
356
359
360
361
370
372
373
374
375
379
380
381
383
385
386
387
388
391
393
394
395
397
398
399
401
409
410
411
414
420
421
422
423
424
425
427
429
430
434
438
439
444
449
450
454
457
458
459
460
461
464
465
466
468
469
470
471
472
480
484
486
487
489
490
00:00
Human migrationDiscrete element methodPhysical systemSource codeHeat transferAddress spaceProcess (computing)Principal ideal domainCodeDemonHuman migrationRow (database)Video gameProcess (computing)Kernel (computing)ImplementationInformationPrincipal ideal domainAddress spaceCodeMultiplicationInterface (computing)Basis <Mathematik>Multiplication signCASE <Informatik>Sinc functionSemiconductor memoryINTEGRALPoint (geometry)Computer fileRun time (program lifecycle phase)Physical systemSource codeSoftwareRight angleMilitary baseHeat transferNetwork topologyMereologyGroup actionProjective planeMedical imagingDifferent (Kate Ryan album)MiniDiscPower (physics)Game controllerSystem callOnline helpCausalityWordSpeicheradresseSineQuicksortReal numberComputer animation
06:52
CodeInformationProcess (computing)Information securityHuman migrationEvent horizonPrincipal ideal domainCloningOpen setMorphismusWeb pageRead-only memoryStructural loadHydraulic jumpHuman migrationPosition operatorProcess (computing)Product (business)Computer fileINTEGRALMathematicsCASE <Informatik>Multiplication signGame controllerLevel (video gaming)CodePhysical systemRight angleInformation securityVideo gamePrincipal ideal domainSemiconductor memoryWeb pagePoint (geometry)Inclusion mapMedical imagingDifferent (Kate Ryan album)SoftwareAsynchronous Transfer ModeMereologyThread (computing)Group actionKernel (computing)Demo (music)Online helpRun time (program lifecycle phase)WhiteboardInformationSoftware testingInterface (computing)Limit (category theory)ImplementationSinc functionSet (mathematics)Web crawlerProjective plane19 (number)Address spaceGoogolOntologyReal numberMetropolitan area networkComputer animation
13:37
Computer virusFile formatRootMultiplication signCartesian coordinate systemFlagJava appletIP addressResultantHuman migrationFilm editingState of matterUniform resource locatorRule of inferenceProcess (computing)CASE <Informatik>2 (number)Point (geometry)Computer animation
15:10
Computer virusFile formatRootFile systemWeb pageSemiconductor memoryMathematicsState of matterFile archiverComputer fileProcess (computing)Point (geometry)Physical systemComputer animation
15:37
File formatRootIP addressComputer virusGraphical user interfacePhysical system2 (number)Computer animation
15:54
File formatComputer networkRootState of matterComputer animation
16:14
File formatComputer networkComputer virusIP addressSpeech synthesisGoogolDatabaseMetropolitan area networkConnected spaceHuman migrationSoftwareArithmetic progressionSpacetimeProcess (computing)Multiplication signBitPlotterIP addressVideo gameWhiteboardNetwork socketOraclePoint (geometry)Computer fileNamespaceSocket-SchnittstelleDirectory serviceGoodness of fitNoise (electronics)Computer animationXML
20:20
Point cloudFacebookOpen sourceComputer animation
Transcript: English(auto-generated)
00:05
Welcome to my talk about container-like migration. My name is Adriaan Rieber. I work at Red Hat. I'm involved in process migration, which is the basis for container migration, since for the last 10 years at least.
00:23
This is all based on CRIU, which I will give an introduction about here, and I'm working on CRIU since 2012 at least. And I'm focusing on container migration since 2015. Everything I'm talking about here has been already written down.
00:41
An article can be found here. And I want to start with the definition of what I think, or what I do when I say container-like migration, because it's something people often ask me about the details.
01:04
So basically it's the idea of transferring a running container from one system to another. You could also say stateful migration, so the process just continues to run at the same point in time. You stopped it before the migration, and the basic concept is I serialize the process or the whole container on my source system somehow.
01:30
Then I transfer it to the destination system, and then I just restore it, restart it, and the container keeps on running at the same point in time. I started the migration of the whole thing.
01:43
As already mentioned, this is all based on CRIU, checkpoint, restore in user space. And there are multiple integrations of checkpoint, restore, user space in different container runtimes. I will give an overview later which container runtime has CRIU support right now.
02:05
And the main things I will demo will all be Podman based. So this is about integration of CRIU and Podman and how to use it to live migrate containers.
02:20
So I want to give you some details about CRIU works, how CRIU works. So the first step you have to do, you have to checkpoint your processes. So you have a container and you have multiple processes running inside it, and you tell CRIU I want to checkpoint this container. You point it or you give it the PID of the first process in the container,
02:41
and it will just stop and collect the information of all processes in the process tree. So all child processes are always checkpointed with the first process. And CRIU does this or one possible way how CRIU does this is using ptrace to stop the process.
03:03
There's also the way to using the C group freezer to stop the processes. And so CRIU stops the processes, collects all the information and writes it to disk. And so the tool is named CRIU, is named checkpoint restore in user space. And there's a reason for the name because before checkpoint restore in user space was developed,
03:28
there were multiple other checkpoint restore implementations for Linux there, and they were not in user space. They were either completely in kernel space or they were even more in user space with syscall, syscall, something, something.
03:51
So whatever. So CRIU works a different way. CRIU tries to use existing kernel interfaces as much as possible. So there's basically not one kernel interface added by CRIU, which is only useful for checkpoint restore.
04:11
The interfaces CRIU added to the kernel are most of the time to get more information about the running process from the kernel. So there are also, for a lot of the things CRIU added, there are other use cases which are using this new information,
04:28
which CRIU added since 2012. And once CRIU collected all the information from the proc file system, then there's the next step, which is called the parasite code.
04:42
This is probably my most favorite part of CRIU because it's also the craziest if you go into the details because it's, you wouldn't expect something like this when you start looking at a project like, I don't know, I wouldn't expect it at all, somebody doing something like this.
05:02
So the parasite code is injected into the running process. So the process has been stopped, paused using ptrace, and now CRIU extracts some code out of the process using ptrace and replaces this code with a parasite code. Now that the parasite code is in there, CRIU restarts the process at the point of the parasite code.
05:23
The parasite code is running inside of the address space of the processes it wants to dump, and it's running kind of a daemon, so the parasite code connects to the main CRIU process, and the CRIU process can send commands there to the process to do things from within the address space of the running process.
05:45
And one of the main, or one of the biggest things which are happening from inside of the, from the parasite code is dumping the memory from the process to disk so that it can be later be restored and the same memory information is there as before checkpointing.
06:05
And although ptrace offers a way to extract memory from the process, out of the process, this used to be slow at a point where the parasite code was written,
06:20
and if you're looking at migration times, the dumping of the memory is really fast because most of the time for your migration will always be spent by the transfer, by the network transfer to transfer the checkpoint image from one system to another. So the parasite code is used to write all this information to disks,
06:42
to disk, and once the parasite code is done, it's removed from the process, CRIU cost is now curing the process, the original code is restored, or the parasite code is removed and the original code which was there will be copied back so that if you want to continue to run your process,
07:04
it will just run without ever knowing that it was under the control of the parasite code. And at this point, the checkpointing is basically finished, all the information has been written to disk, and in the case of migration, you would probably kill your target process
07:21
so that it stops, but it also can continue to run whatever you feel like is the best for your use case. What's also interesting about container-like migration is if you're running with Podman, you're probably running on a system with SELinux, and SELinux and CRIU is especially interesting.
07:42
I gave a talk at the Linux Security Summit about this, because CRIU does things which the SELinux policy is not really happy about, so you have to invest some additional time to let CRIU do the right things if it's running under the SELinux control, but this is just too much for today, here, for my time.
08:05
And so once the checkpointing is finished, you come to the second step, that's the restoring of the process, and so first what CRIU does, it reads all the checkpoint images to see what is there, and then CRIU basically creates a process for each process which used to be in the process tree,
08:25
and for each thread which used to be there, and there was a talk at Linux Flammers Conference I gave about CRIU and the PID dance, because creating a process used to be complicated on Linux, so you had to, there was an interface,
08:44
and you had to write the PID you want to the interface, and then be really fast with your fork, and hope that no other process is created during the same time, but with the help of Christian, we introduced clone3, and now we can create a process with a certain PID,
09:02
this is available since, I guess since Monday, Linux 5.5, and CRIU also has all the code to use clone3 if your kernel has it, so now CRIU can create new processes with less syscalls, and without any races that some other process might have been created in between,
09:20
and once all these processes have been created, those processes are now morphed into the process which should be restored, and then I like the position, the example about file descriptors, so CRIU just, so what CRIU does during checkpointing, it tries to figure out all the file descriptors,
09:41
and to which file they point, and which position they are, and this writes, and CRIU writes that to the checkpoint images, and once the process is restored, the file is opened with the same file descriptor, it's seeked to the same position, and once the process keeps on running, the file descriptors in exact the same situation used to be before checkpointing,
10:01
and so that's basically what CRIU does with all the other resources the process is using, all the memory pages are mapped back to the place where they used to be before checkpointing, and we are loading all the security settings, APARMA, SA Linux, and SATCOM,
10:21
we're doing this as late as possible as mentioned, to do not have those policies interfere with CRIU's changing of the process, or restoring of the process, and once the process has been set up in all the ways that it has to be, we are jumping back in the original code,
10:41
and the processes can continue to run at the same point in time where we checkpointed them before, so that's where the process restore is finished basically, and so now to container life migration to the actual inclusion of CRIU into different projects,
11:05
I think the first one I have to mention here is OpenVZ, because they invented CRIU for their container use case to be able to life migrate their containers from one system to another, I never used it myself, but that's the group who invented CRIU,
11:26
then one interesting user of CRIU is Google, which we were informed like one and a half years ago, and so Google actually uses in their container runtime board CRIU to life migrate processes in production a lot,
11:43
and as far as we upstream CRIU know it, they are very happy with how it works, and it works reliably for them, so this is something which we're pretty happy about as upstream, and LexC, LexD has a long integration of CRIU for a very long time already,
12:01
then there's an integration of CRIU in Docker, you have to enable the experimental mode to use it, and at this point in time I would say it's basically unmaintained, so I'm not sure how good it works right now, and then the thing I've been working on the last two years is the integration of CRIU into Portman,
12:24
and we have seen a talk about Portman in the morning already, it's a container engine runtime which is daemonless and rootless, and I started to work on this beginning of 2018,
12:40
and first code was merged in May, was written in May and merged in October 2018, this was only the checkpoint restore implementation, so you could checkpoint your container, reboot your system, restore your container, and it would continue to run at the same point you have checkpointed it, and then I continued, oh and this required changes in all the levels of Portman,
13:05
RunC, Conmon and also CRIU for how Portman handles network namespaces, and then after that I continued to work on the container life migration for Portman, this was merged in 2019 last year, this already also required changes on all the levels which are involved,
13:29
also the SELinux changes were part of this, and with this I'm already at my demo, I copied the commands from my demo here on the slides, but let's run them here,
13:48
so what I'm doing here is I'm running a container with a wild fly container, I have a stateful application there, so that container migration is at least somehow useful,
14:02
so let's start the container here, the wild fly container is a nice use case, because it actually takes some time to start, because all the Java things need to be loaded, and actually restoring it from the checkpoint is much faster, like 50% faster than using it, than starting the container fresh,
14:21
so now I can access my Java container, so I have the simplest application which just returns an integer, and every time I read it, it's increased by one, so I'm using curl to access the IP address from the container, and my application is called hello world,
14:41
and the first result is zero, and the second result is one, so it's simple but it's stateful, now I'm telling Portman to checkpoint to container, I'm using the flag minus r, this tells Portman to keep the container running, so I'm making a checkpoint of my container while it keeps on running, so now Portman is telling CRIU to make the checkpoint,
15:04
the checkpoint has been written to disk, and now I'm accessing my container again, and now it should say two and three, and so the container keeps on running while I made the checkpoint, now I'm transferring the checkpoint archive,
15:21
the archive includes all the files about the running processes, all the memory pages which have been dumped, and all the changes which have been made to the file system of the container, so this includes all file system changes, and all process state which I'm now transferring to another VM on my system,
15:41
and now I'm telling Portman on the other system to restore the container, and this takes about four seconds usually, something like this, now the container is restored, and now I can access the container using cURL again, and now I'm getting back the two which I got back there on top,
16:02
which is the same value before checkpointing the container, so I checkpointed the container, it probably changed its state, but I can continue the container from the same state, and it used to be before checkpointing. That's my demo, and with that I'm already done, thanks.
16:33
Hi, thank you for your talk, it's cool technology,
16:42
you mentioned that a year and a half ago Google integrated this into Borg, my understanding about Kubernetes is stuff supposed to flow from Borg into Kubernetes, at least theoretically, have you heard any noise about people being interested in checkpoint restoring in Kubernetes? No, personally I haven't heard anything of that,
17:03
basically this was integrating it into Portman is my first step into getting it somehow into Kubernetes, so now I have to somehow get it into, I don't know, cryo or something like this, and then maybe Kubernetes, but that Google uses it internally,
17:22
might make the discussion about the usefulness of container-like migration to Kubernetes maybe a bit easier, because that's probably one of the problems that containers are stateless, why do you have to live migrate them, but besides that it might make it easier to get it into Kubernetes this way.
17:44
Hi, you talked about file descriptors being copied over, can you talk more about sockets being copied over, like how it works behind the back? So this is probably the question about TCP sockets, something like this,
18:01
so cryo can checkpoint and restore network sockets, so if you have a working TCP connection it will still work on the destination host, the only thing you have to do, the restored process have to have access to the same IP address, because without the same IP address you cannot restore a TCP connection,
18:22
and for UDP it doesn't matter, it just works, and for TCP you have to have the same IP address. Other questions?
18:43
Databases, could you please tell us more about how it's good with databases, of course we had this experience before, and databases, I think they usually need to be stateful, and that was a problem for us to handle migration of active databases actually, so how is the progress right now with this, thank you.
19:03
So, databases, so I guess this basically depends on how your database is overlaid in your container, if all your database files are mounted into the container, then it's probably you migrate your container, and you have to migrate your data directory and then restore it, this should work,
19:24
and there are many years ago we tried to migrate Oracle databases, and this worked, but the database shut down itself after the migration, and we think that this is because the time is different on the different hosts,
19:47
so with the time name space which was just accepted this week, and once it makes its way to CRIU, this could be as often a way that we can tell the process in the container
20:02
that your time actually hasn't changed, you're still running on the same clock monotonic as before or something like this, so the work on the time name space is probably the most important for the database I would guess. Thank you.