Exploring CRIU
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 63 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/54613 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
openSUSE Conference 20165 / 63
1
5
8
13
14
19
20
24
25
31
32
33
34
35
37
38
40
43
44
45
46
47
49
50
51
52
53
54
58
59
61
63
00:00
SuSE LINUXGoodness of fitKernel (computing)Different (Kate Ryan album)Regular graphComputer fontWebsiteSpacetimeComputer animationLecture/Conference
00:47
Point (geometry)Computer fontTorvalds, LinusPatch (Unix)CodeKernel (computing)Network topologyProcess (computing)SpacetimeComputer fileMiniDiscCore dumpComputer-generated imagerySurjective functionData storage devicePersonal digital assistantTask (computing)Vertex (graph theory)System callSystem of linear equationsCASE <Informatik>Multiplication signSpacetimeDifferent (Kate Ryan album)Point (geometry)Patch (Unix)Task (computing)Cartesian coordinate system2 (number)Single-precision floating-point formatProjective planeKernel (computing)Physical systemComputer fileProcess (computing)Block (periodic table)Software developerSlide ruleCodeMathematics32-bitVirtualizationComputer architectureImplementationSinc functionDistribution (mathematics)Generic programmingRevision controlMedical imagingRepository (publishing)MiniDiscNetwork topologyEmbedded systemEnterprise architectureDigital video recorderSupercomputerWeb pagePressureInstallation artProduct (business)ExistenceMaxima and minimaHuman migrationProfil (magazine)ArmPrisoner's dilemmaLevel (video gaming)Message passingDiscrete groupOffice suiteData managementMortality rateStatisticsCommitment schemeComputer animation
08:42
Interface (computing)Extension (kinesiology)Core dumpComputer-generated imageryScripting languageGroup actionComputer fileInformationDiscrete groupWeb pageProcess (computing)Communications protocolFile formatBuffer solutionGoogolPrincipal ideal domainComputer configurationScripting languageE-bookStatisticsMessage passingPrice indexDirectory servicePiComputer programmingBuffer solutionGroup actionExtension (kinesiology)Interface (computing)StapeldateiSoftwareObservational studyComputer configurationObject (grammar)Communications protocolNumberProcess (computing)Speech synthesisMedical imagingDynamical systemGoogolComputer fileDesign by contractLevel (video gaming)DampingData storage deviceForm (programming)Multiplication signLibrary (computing)File formatUtility softwareComputer animation
10:56
Process (computing)NumberPrisoner's dilemmaCASE <Informatik>
11:26
Core dumpWeb pageStatisticsInformationDifferenz <Mathematik>Task (computing)Demo (music)Gastropod shellComputer fileDirectory serviceRegular graphProcess (computing)CASE <Informatik>Content (media)Position operatorCodeSpacetimeCore dumpRoundness (object)Computer animation
13:55
Physical systemElectric currentHuman migrationSummierbarkeitLetterpress printingFlagHecke operatorInformationHill differential equationComputer-generated imageryComputer fileSimulationFile formatFile viewerServer (computing)Connected spaceInternet forumCoding theoryStreaming mediaElectronic mailing listInformation securityVisual systemLimit (category theory)MathematicsDemo (music)Revision controlTask (computing)XINGHuman migrationComputer fileBranch (computer science)Multiplication signPatch (Unix)BitWeight40 (number)QuicksortSoftware developerComputer clusterPurchasingSinc functionView (database)Physical systemStatisticsMobile appComputer configurationData storage deviceObservational studyImplementationResultantAnalytic continuationComputer animation
19:33
InjektivitätCodeAsynchronous Transfer ModeProcess (computing)PiComputer fileThermodynamischer ProzessTask (computing)Web pageCore dumpInformationComputer networkRead-only memorySystem callContent (media)RootStrategy gameNetwork topologyMorphingKernel (computing)Hydraulic jumpThread (computing)CodeGroup actionSemiconductor memoryStreaming mediaAsynchronous Transfer ModeSystem callInformationCore dumpSoftwarePhysical systemShared memoryWeb pageBuffer solutionNetwork topologyForm (programming)Process (computing)Computer fileLevel (video gaming)Projective planeTask (computing)Independence (probability theory)Position operatorThermodynamischer ProzessCodeGame controllerKernel (computing)NamespaceInjektivitätStrategy gameConnected spaceRootAreaCommunications protocolContent (media)File formatProfil (magazine)Principal ideal domainInterprozesskommunikationWeightScripting languageMappingBlogBlock (periodic table)ResultantExtension (kinesiology)Combinational logicInternetworkingComputer configurationData storage deviceRoundness (object)DampingCentralizer and normalizerNumberAdditionSpacetimePerfect groupObservational studyRight angleArmComputer animation
25:02
Point (geometry)ConsistencyComputer configurationNetwork socketComputer fileUniform boundedness principleProcess (computing)Local GroupMobile appServer (computing)InterprozesskommunikationSpacetimeCore dumpThermodynamischer ProzessPhysical systemKernel (computing)Network topologyProcess (computing)Computer fileKeyboard shortcutLoginHome pageSource codeWikiGroup actionGoodness of fitComputer configurationFile systemHigh availabilityVirtual machineMultiplication signDevice driverException handlingSimilarity (geometry)Server (computing)Cartesian coordinate systemPlug-in (computing)Physical systemNamespaceRootUniqueness quantificationMereologyClient (computing)ConsistencyContent (media)Connected spaceData managementLevel (video gaming)Gastropod shellPoint (geometry)Kernel (computing)Graph (mathematics)Patch (Unix)Intrusion detection systemCASE <Informatik>Different (Kate Ryan album)Ferry CorstenExtension (kinesiology)Network socketLimit (category theory)RoutingSpacetimeLie groupSpherical capSocket-SchnittstelleArmComputer animation
30:11
Machine visionLink (knot theory)Wechselseitige InformationIntegrated development environment
30:56
YouTubeMultiplication signLecture/Conference
31:20
Computer animation
Transcript: English(auto-generated)
00:09
Okay, it's time, it's now good, it's just after lunch, and I hope you are well refreshed or ready for sleep now.
00:22
So anyway, let's start. My name is Takashi. I am in regular work, I'm usually working on the kernel side, but today's talk I would like to show you something different. Also, it's between kernel and user space, and that's CRIU.
00:45
So this is an outline of this presentation, and at first I will give you a brief introduction, so what's CRIU, and then followed by the basic design and how to use CRIU, and how
01:04
the CRIUs are adapted by containers. Then we'll deep inside details of the implementation of the CRIU, so how it is implemented, and then we'll discuss some, so what is the problem right now by CRIU.
01:23
So let's start from the very, very basic question. What is CRIU? Does anyone in this room have idea what is CRIU? Yep, great answer.
01:41
I love this audience. It's almost correct. Well, it's Crazy Russians Embed Unix, that's, well, I forgot that this is video recorded. Well, okay, that's even listed as an official acronym, and it's a CRIU web page.
02:06
The upstream developers have a nice sense of humor, of Russian humor, and that joke came from the git commit message, and the patches have merged in the Linux kernel. Yeah, it's the installers and Andrew Morton, after all.
02:21
So you see that message. So the answer was correct. The CRIU actually manage checkpoint restore in user space, and this is a tool to allow you, well, kind of suspend the resume for each user space process or process tree,
02:44
or more accurate to say, it's a hibernate resume over your process. And with that, you can just dump your running process or process tree into a disk storage, and at any time later, you can restore, restart the dumped process again.
03:05
And one interesting thing in this CRIU is that it's implemented almost in user space. And the primary use case is of CRIU like that. So the first one is very obvious, snapshotting and checkpointing, and restart the task.
03:25
It's especially useful for long-running tasks like high-performance computing. Yeah, you want to make sure that your task can be broken at any time, so if something wrong happens, then you can restart the task in the middle.
03:40
And the second use case is also obvious, and it's more typical. That is demanded by containers and so on. That is a migration of tasks. So you can migrate a process, any process, a process tree, between containers on a single host, or between different hosts even.
04:04
And so that use case is a little bit different. It's a fast start of a task or a container. So you can imagine the similar technique used by Emacs or TAC. So some applications take very, very long time to initialize, to start up.
04:22
And they read many files and evaluate many things. And by CRIU, just you dump right after the initialization is done. Then the next time, you can resume from the dumped image without any long-time initialization.
04:42
I mentioned that CRIU, one of the important points is that it's implemented mostly in user space. So why, you may ask, why? Because such a notable thing is usually a task of the kernel. Well, upstream developers thought in the same way.
05:02
And at first they implemented everything in kernel, and that resulted in over 100 patches. And well, kernel people didn't like that, nuck. So they changed their minds, they re-implemented things, moved their stuff into the user space as much as possible.
05:22
And only the very same small things is left in the kernel, and that was successful, accepted by mainstream kernel, as you saw in the first slide. Actually, what kernel does with the checkpoint resource, there's a very minimalistic, only a few changes for
05:43
the system codes and the block file accesses, nothing else. And one point to be noted is that CRIU has no performance impact on the running task. So CRIU enables something on checkpoint dumping and restoring, but
06:02
there's nothing about the running process itself. But there is always downside. In this case, drawback is that CRIU has to take care of everything. So every single dirty work is implemented in CRIU itself.
06:21
This slide shows the development status, and upstream developers are mostly working on pathways. And they work also on OpenVC and virtual so, which are long standing and container technologies. The CRIU project itself started back in 2011, and
06:40
soon after that, the patch was merged in the kernel. And currently, there are multiple supported architectures, x86-4, ARM64, and per PC64, did it end in. And 32-bit Intel support is ongoing, but still not yet. Note that these architectures are only about the user space, CRIU stuff.
07:05
And the kernel supports all architectures. And that's easy, and that's generic even. Project is still in very active stage, and there are multiple comments each day. And they have even the regular monthly release, and
07:22
as of today, their latest version is version 2.3. On SUSE, OpenSUSE, we have supported CRIU from the beginning. It's since OpenSUSE 13.2, and that was a kernel 3.16.
07:43
And of course, Deep and TumbleWitch are supporting CRIU. And the latest CRIU package is found always in the OBS develop project, the bell column tools project. And package has a little dependency, so you can drag to all the distributions from the repository.
08:04
On the other hand, the SUSE Linux Enterprise, we did not support CRIU at all so far. And that's there, but good news is that we are going to support, at least in kernel side, SVA11, sorry, 12 SB2, but still experimental and not decided yet.
08:23
And package is not found in the SA12. So if anyone in the enterprise customers want to have that, then just ask PM, so product manager, and maybe somebody here room, and give them the pressure.
08:41
So let's go to the basic design usage. CRIU has three different interfaces. One is a command line, and one is RPC, and another is a library interface. Most of the programs use the first one, command line. Even the containers just invoke the command lines.
09:01
And some programs use RPC, but as far as I know, there is no program using library API. CRIU provides two extensions. One is a plugin, and that's a shared object, so dynamic linked into the CRIU, and action script.
09:20
And this is a script that is executed at each different stage of the damping restore. For example, the locking and unlocking network, the action script is used. And the image files that CRIU creates, CRIU creates multiple files, not that whole archive, but multiple files on the directory, and even for
09:45
each process. And each image file is in the format of the Google protocol buffer. It's very portable and efficient. Also, CRIU provides a rich utility program to parse that image file between JSON, and even you can manipulate the image file by that command.
10:07
And the command line invocation is very simple, the modern form. There's CRIU and the subcommand and options. For the checkpointing, so damping, you run a CRIU damp and dash t process ID,
10:21
and dash dash d for the directory to store as a damp files and more options. Depending on the situation, you need to pass many options, but I will mention later. And for the restoring, they run a CRIU restore and dash dash d and directory and the same option as, mostly same option as a damp time.
10:42
But without the process ID, because process ID itself is restored by the restore actions. So, I will show you some demo. There's the Python script calculating the number pi. See, so just starting that, and it showed that many, so numbers pi.
11:12
And so, you see that that's a process, and the process, so please remember,
11:22
remember the process number, process ID, in this case 1789. And, okay, make, so that's first, oh, okay, then one, I forgot to remove.
11:47
So it's an empty directory now, CRIU, and then damp. I forgot the process ID, that was, I forgot.
12:22
And because the task is running on the shell, we need to pass a dash dash shell job in this case. And it's killed, so that is 30,000. And you see that directory here, that's files are created.
12:42
And this file, so, damp, demo one, okay, create show in pantry. So, you can see that each file content by create commands, so, and for example,
13:02
regular files, then this shows that which files have been opened, and they were recorded here. Now, restore, show job, and remember that's on the 13,000 something, we stopped.
13:28
And starting again, in the same position, so that's, and, and again. And one thing interesting is that process ID is even restored.
13:47
It shows the very same process ID, and the very same user. So, that's, then how the CRIU is adapted by containers. There are many containers technology on the Linux, and
14:03
let's see that, which, which, which, that support CRIU. And first of all, the containers by the upstream developers, and of course, they support CRIU natively, and are purchased on OpenBC. And the next one, edXC and edXC, they do support also the CRIU,
14:24
since a couple, I forgot, years ago, or a year ago, version 1.1. And the current version supports even better. And the minor one, systemd-enspawn, this does not support at all yet. And because upstream developers, systemd-developers has some concern
14:44
dependency on CRIU. So the systemd does not want to have a dependency by some funny reason. Then the biggest one, the hottest one, the Docker. And the Docker does not support CRIU as of the version 1.12 yet.
15:03
And that was a little bit unfortunate that Docker itself was restructured just at the time that CRIU support pre-request was sent. So the Docker was switched to the runc and containerd. And so because of that, it didn't happen. But as a result, runc and containerd do support natively CRIU already,
15:25
but only Docker does not. And there have been different forks for supporting CRIU by Dockers. And the most promising one is the Rust approaches branch. And that is also the patches that's merged in the runc and containerd.
15:40
So for the no container migration implementation, yes, not yet. So that there are some more demo about Docker. Okay, let's start the very same thing.
16:01
So we have the Docker, Docker file. So that's just building in the py.py Docker. So it's done. And running the Docker. This one, okay, that's name is py and py.py.
16:28
So it's running there in the container. You see the container is running. And there is a new command checkpoint.
16:43
And you have the three further subcommand by checkpoint. So the first create by PPP. That's the checkpoint name PPP is created. Then it's done. It's gone.
17:04
And we can restart again. Okay, checkpoint. And there you find the checkpoint name PPP that is stored there. Now they restart the task with a new option,
17:22
checkpoint, okay, and that's one. Then it starts from that last number. And you can start even again after stopping that.
17:42
And it's there. Okay, that was a rather boring demo. So we can try to do something more interesting. So this one is to starting the XVNC session
18:02
in the Docker container. So the whole desktop session is running there. So now container VNC is running. And I can connect to this container with VNC viewer. So starting here and X term.
18:23
Yeah. Oh, yeah, that is, it's root. And something more interesting. And I can start, I can play something like that. Yeah, yeah, that's not good.
18:49
And anyway, that's, then again, Docker checkpoint. And VNC, doom, oh, checkpoint create.
19:00
Then it's gone. Then again, it's there.
19:22
So that works. So, okay, go back. So that was how that CRIU works. And now I'll show you how CRIU is implemented.
19:42
So CRIU uses a few very low level techniques or so, so remember that that's, it provides a ptrace and mmap system codes. These are very obvious system codes. You can do everything. And especially ptrace is used by debugger and tracer.
20:02
And so you can manipulate everything. And combining these two system codes, CRIU implements so-called parasite code injection. So that this is a way to running the system code on the following process. And by injecting so small position independent executable
20:23
codes by mmapping on the following process. And then running that code by ptrace. And after getting a result, returning the host process and cleaning up the injected memory area again and clean up everything.
20:41
So as if there is nothing there. Also, another interesting technique is a TCP repair mode. In this mode, the TCP behaves as if that's just behaving the stream and changing its stage, but it does not handle any data itself. So it's used for the resuming on the network connection.
21:04
Linux kernel provides many profiles and there are many, lots and lots of information there. And CRIU use that intensively. For example, process tree and pages and mapping VMAs, file descriptors and project timers, name spacing, so on.
21:22
And these are all found in the blog files. And for dumping the process. So roughly speaking, there are three stages. So as first CRIU stop the task and for that, usually ptrace is used. Or optionally, you can use a C group freezer.
21:45
And by that, the task, so given task, what task tree is stopped. And also you need to knock at internet so that the stage is kept consistently during dump and restore.
22:01
And it's usually done by net filter or action script for the network name space. Then the CRIU gathers the whole process information. As I shown there, it passes block files or it runs a process injection calls
22:21
for foreign processes to get credentials or memory contents or signals and so on. And these are stored in a dump file and protocol buffer format. And then it dumps the pages. It's done by VM supplies and the supply system calls. So the dumping is actually easy.
22:42
You need to gather the whole, just information and save to file. But the problem is restoring, it's tricky. The strategy that CRIU takes is to fake the process tree as if the original tree, then it moves to the dumped process again.
23:02
So the first from the root process, that is a CRIU restore command itself. So it forks the processes, trees, and the way is exactly the same form as a dump process. Then it restores a dump process information
23:20
to each process and it moves to the original process back. So that affects the tree. As you saw there in demonstration, the process ID was even restored to original process number. This is done by setting the system call,
23:40
so kernel and its last PID. Also shared memory IDs and system, so IPC IDs, also they are restored even by system calls or system control. And open files, the CRIU needs to reopen the very same files that have been opened and also shared anonymous VMA and so on.
24:03
And then also restore the shared memory and namespace C group and session ID group by this on. And one other interesting technique that CRIU takes is restore, restorer. A restorer is a kind of trampoline
24:22
that's a small position independent executable code. And this is necessary for avoiding the segfault by unmapping. So at cleaning up, you need to clear unmapping the memory that has taken. But doing that from the original process,
24:42
you will get a segmented fault. So for avoiding that, you jump again once, somewhere outside, and then doing such things, then go back, jumping again to the new code that is original process code. Then cleaning up the original code.
25:03
So, so far, I presented the CRIU as if it's perfectly working and beautifully. Everything is fine. Sorry, I lied. It has lots and lots and lots of problem. One of the biggest problem that it's,
25:23
if that process or process tree is connected to the outside, then you cannot guarantee that that process is restored in the same way because it's outside. You have no idea. So CRIU behaves basically very conservative way
25:41
so that it says that I cannot dump. Then you have to convince CRIU, like the small charge, small cadence. Honey, it's no problem. It's not serious problem. Just go on. For that, you give them options. So different, many options depending on the situation.
26:02
For example, the four sockets, unique socket, if the only one of the pair side is dumped, you have to pass then that XT unique socket, and also for the TCP connection. Also shell job, if the shell itself is outside, then you need to pass a shell job option
26:20
to restore the TTY and group and session IDs. Also file logs can be external, and byte mounts, the bind source is external, then you have to specify which byte mounts to be done. And another big problem is that it cannot handle the bytes files at all, at all.
26:43
Because you have no idea what's the device driver, how the device driver behaves. And device driver behavior is different even on different machines, so you cannot migrate easily from one machine to another. Of course, exceptions are very generic ones,
27:01
like DEV0 or DEV0 or DEV0 or DEV0 or DEV0 or DEV0. But for any other device file, you need to write a plugin for each, so it's very hard. On the similar reason, X application also cannot be dumped, because X server has no idea about a client is dumped.
27:24
Also more restrictions, this is only a part of the restrictions. And the file systems, Clue itself does not take care of any about file systems. It relies that file systems is consistent between dump and restore.
27:40
There is one exception, the so-called ghost file. If the file was deleted but still accessed, so that's invisible from file system, in that case, file system cannot keep the consistency. In that case, Clue just dumps the file content by itself.
28:04
Also system VIPC, that has to be in namespace because it's more or less anonymous. Also names, nested namespace or C group might be problematic too. Also the root permission problem still exists because the namespace access needs a root privilege.
28:25
So taking a look at this, all that limitation, well, we can say that Clue actually requires the whole process tree to be self-contained. If it's self-contained, that can be restored,
28:41
as we have seen that a whole accession could be restored. Or a user has to specify the options at each time, but it's, well, nasty. However, usually container management tool takes care of that option by itself. And another point is that, well,
29:02
Clue cannot be practically used as a system level checkpoint restorer because, well, on bare metal without device drivers, it's useless. So it's different from the whole system level suspend regime. Also, there was a discussion that Clue could be used
29:23
as zero time downtime things as a live patching. But, well, as far as I say, that it cannot be compatible with system level kernel live patching, so key graph to us on. However, Clue is a very good tool for containers
29:44
or even for high availability nodes. It's because it enhances that container does not have. So that is, well, you can still find very interesting ways that you use Clue.
30:01
Okay, that's all. Now, resource, you can find our homepage in Clue. That's very, very organized and up to date wiki. So, I think it's all, time is almost up. Maybe one just question or?
30:35
Hi. So you mentioned that you can use Clue to,
30:41
well, in HA environments. So you can basically suspend a process and resume it on another node. Yes. Okay, I just wanted to be clear on that. Yeah.
31:09
Okay, the timer's up. Yep, thank you for listening.