We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Container Live Migration

00:00

Formal Metadata

Title
Container Live Migration
Title of Series
Number of Parts
44
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Producer

Content Metadata

Subject Area
Genre
Abstract
The difficult task to checkpoint and restore a process is used in many container runtimes to implement container live migration. This talk will give details how CRIU is able to checkpoint and restore processes, how it is integrated in different container runtimes and which optimizations CRIU offers to decrease the downtime during container migration. In this talk I want to provide details how CRIU checkpoints and restores a process. Starting from ptrace() to pause the process, how parasite code is injected into the process to checkpoint the process from its own address space. How CRIU transforms itself to the restored process during restore. How SELinux and seccomp is restored. I also want to give an overview how CRIU uses userfaultfd for lazy migration and dirty page tracking for pre-copy migration. I want to end this talk with an overview about how CRIU is integrated in different container runtimes to implement container live migration.
Human migrationSystem programmingProcess (computing)Human migrationContent (media)Basis <Mathematik>Similarity (geometry)Computer animation
System programmingHuman migrationBlogDemo (music)BlogContent (media)Different (Kate Ryan album)BitComputer animation
System programmingHuman migrationPoint (geometry)Human migrationVirtual machineView (database)Expected valueComputer animation
System programmingPhysical systemSource codeHeat transferHuman migrationMiniDiscState of matterPhysical systemSource codeVirtual machineMultiplicationHeat transferComputer animation
ImplementationAddress spaceDifferent (Kate Ryan album)Computer animation
System programmingProcess (computing)Principal ideal domainInformationKernel (computing)Interface (computing)INTEGRALSoftware developerAddress spaceProcess (computing)CodeMultiplicationPrincipal ideal domainMiniDiscMetropolitan area networkRadio-frequency identificationComputer animation
CodeSystem programmingProcess (computing)DemonTask (computing)Semiconductor memoryWeb pageMultiplication signCodeAddress spaceProcess (computing)MereologyGame controllerInformationSurjective functionCASE <Informatik>Shape (magazine)Heat transferGoodness of fitBookmark (World Wide Web)SoftwarePhysical systemDemonMiniDiscComputer animation
System programmingInformation securityHuman migrationPoint (geometry)InformationHuman migrationProcess (computing)CodeSystem of linear equationsInformation securityMiniDiscComplete metric spaceComputer animation
System programmingComputer-generated imageryReading (process)Process (computing)MiniDiscMedical imagingSemiconductor memoryHuman migrationComputer animation
Event horizonPrincipal ideal domainCloningProcess (computing)Principal ideal domainNetwork topologyCloningInterface (computing)Computer animation
Open setMorphismusStructural loadInformation securityProcess (computing)Hydraulic jumpSystem programmingHuman migrationComputer fileRight angleSlide ruleProcess (computing)Scripting languageUniform resource locatorCodeInformation securityOpen setState of matterVideo gameMultiplicationPosition operatorSemiconductor memoryStructural loadMappingHuman migrationNetwork topologyWeb pageHydraulic jumpMobile appSet (mathematics)Row (database)MorphingComputer animation
Human migrationDemonIP addressFile formatFormal grammarSystem programmingDaylight saving timeComputer virusInformation managementElectronic mailing listTwitterBlogEvent horizonCartesian coordinate systemMathematicsJava appletServer (computing)Virtual machineIntegerComputer fileLibrary (computing)MereologyRun time (program lifecycle phase)Semiconductor memoryHuman migrationVideo gameINTEGRALCodeBlogDemo (music)CASE <Informatik>Presentation of a groupBeta functionKernel (computing)2 (number)State of matterWeb pageMetropolitan area networkMultiplication signFile systemStapeldateiPressureMedical imagingRootLink (knot theory)ImplementationRow (database)ProgrammschleifeProduct (business)MetadataFile archiverSlide ruleDemonWordKey (cryptography)Set (mathematics)Physical systemStack (abstract data type)Point (geometry)Computer animation
System programmingConnected spacePoint (geometry)Service (economics)Demo (music)Hash functionView (database)MetadataVirtual machineMedical imagingHuman migrationIP addressOpen setVideo gameMusical ensembleMereologyCASE <Informatik>Windows RegistrySpeicheradresseXMLMeeting/Interview
WebsiteSystem programmingComputer animation
Transcript: English(auto-generated)
Welcome to my talk about container migration. My name is Adrian Rebo. I work for Red Hat since 2015, and I'm working on process migration, which is the basis
for container migration for the last ten years. I'm involved somehow in CRIU, which is what we use to migrate processes here at least like 2012, and since 2015 I'm focusing on migrating container.
Similar content and what I have here, especially the demo at the end, is also available in a blog post. It's a bit different because it's based on rel8.1 beta, but it's pretty close to what I have here.
The first thing I want to do, I want to define container live migration, what it is, because usually when I talk about container live migrations there are different expectations what it is, and from my point of view it's more or less the same than virtual machine
migration. The first step is you somehow transfer a running container from one system to another, which could also be called stateful migration maybe, or live migration, and multiple definitions would probably work. The first step is you somehow serialize the container on your source system.
This is what we use CRIU for. We write everything to disk and then we transfer it to the destination system and restore it, and then the container keeps running with the same state it had on the source machine. As already mentioned, this is based on CRIU, checkpoint restart in user space, and CRIU
is one of many checkpoint restore implementations over the last maybe 20 years, and the different thing about CRIU is that it tries to do most things in user space.
That's why it has the name. CRIU is integrated in multiple container engines, and I will focus later on my talk on the integration of CRIU into Podman on what I worked on here, and before going to the container engine integration, I want to give some details about how CRIU works.
The first step is the checkpointing of the process or of the container. CRIU uses ptrace to pause the process to stop it, and then it starts collecting information
about the process and writes it to disk. One of the main interfaces CRIU uses is proc PID to collect the information. That's also one of the reasons why it's called checkpoint restore in user space because it queries the information about the process from user space.
When CRIU was initially developed, many interfaces already existed, but the CRIU developers added new kernel interfaces to the kernel to get more information about the running processes out of the kernel, and the at least for me interesting thing about this is that those interfaces are not checkpoint restore only.
They are already used for other things. So this was always important for CRIU to get things into the kernel that it's not checkpoint restore only. Once CRIU has collected all the information about the process using proc PID, the next
step is what CRIU calls its parasite code. This is one of my favorite parts of CRIU and maybe also because it's one of the many craziest parts of CRIU, how to retrieve information from a running process. The parasite code is injected into the running process using ptrace.
Some existing code is replaced, then the parasite code is running inside the process, and then the parasite code is basically a daemon waiting for commands, and the parasite code then connects to the main CRIU process and waits for those commands to do things
for CRIU from within the address space of the to be checkpointed process. One of the main tasks of the parasite code right now is to extract all the memory pages from the process which are later needed to restore the process, and there are ways
of doing this with ptrace, but at the time when CRIU was first developed, ptrace was really slow, and you want to get the memory pages as fast as possible out of the process and onto disk, and if you actually migrate now a process or a container from one system
to another, the time to get the memory pages out of the process is much faster than the transfer time of a network. So this is right now in a pretty good shape thanks to the parasite code and how it works
with CRIU and the process where it's injected to. Once the parasite code has dumped the memory and the other things it does from within the address space of the process, the parasite code is removed again from the process. This is what CRIU calls curing the process, and the process can continue to run, and in
most of the cases the process will never know that it was under the control of CRIU and the parasite code. Maybe if it does check, I don't know, clock monotonic or something, it will see that it was paused for some time, but usually the process does not know that it was under
control of CRIU or the parasite code. At this point the checkpointing is finished, all relevant information is written to disk, and the target process can be killed, or it can continue to run, however you want to use your process migration checkpoint restore.
Another interesting thing when talking about container-like migration is SELinux, because if you do it in a container, the container is running under the SELinux label of the container, and when you have the parasite code in the container and it tries to communicate with the outside of the container, this is something which SELinux is not really
happy about, and also during restore there are multiple steps where you have to restore the policies in the same way they were before checkpointing. I'm also giving a complete talk about CRIU and SELinux at the Linux Security Summit
this year in Lyon, and so once the things are all written to disk, the next last step in process container migration is restoring the process, the checkpoint images are all
written from disk into CRIU's memory, and now what CRIU does, it basically, so if you had a process tree, CRIU operates always on process trees, so you point it to one certain PID, and it will checkpoint restore that process and all child processes, and
will do of course the same during restore to recreate the process tree. There was a talk at Linux Plumbers I gave about CRIU and the PID dance, how CRIU tries to recreate the process tree with the same PIDs it used to have during checkpointing,
and there might be a new interface using clone 3 to improve this in CRIU, but the basic thing is CRIU morphs itself into the process to be restored, so first it forks all the processes, now all the processes are recreated into the state they were during
checkpointing, one example I like to give always is the file descriptor, so CRIU records the file descriptor ID, to which file it points and the position, and during restore it opens the file with the same file descriptor and positions the file descriptor at the same
location, and once the process continues to run, and if it writes or reads from that file, the file descriptor will point to the same location, that's how CRIU tries to restore all the resources it can control, another thing is it maps the memory pages
back to the right location, and it loads the security settings as already mentioned, this is done as late as possible, CRIU can handle AppArmor, SELinux, Seccomp, it does it as late as possible because if it would do it earlier, the security settings
would partially be problematic for CRIU's restore process, and that's why it's one of the really last steps before CRIU then lets the process jump into the original code, and the code continues to run as it was before checkpointing. Now to container life migration after the short 30 slide introduction, container life
migration exists for multiple container engines, right now maybe the first which was using CRIU was Open VZ, the company behind Open VZ was also the company which invented
CRIU to make their containers being live migratable, another interesting user of container life migration is Google, and the last two years at Linux Plumbers they talked about how they actually use, how they live migrate their containers
in production, everything which is not interactive, which is long running batch jobs are live migrated using CRIU from one node to another if they are under resource pressure, CRIU is also integrated in LexC, LexD for some time already now, then there is an integration into Docker, I would call it basically unmaintained
from what I have seen in the last few months here at what's done with it, and then there is the Potman integration which I have been working on for the last one and a half years to get CRIU into, to make Potman, to enable Potman to
live migrate containers. Some keywords about Potman, it makes containers run without a daemon like maybe LexD or Docker, you can run it without root, just as a user you can run your Docker containers, and the checkpoint
restore implementation for Potman which I did, it started some time in the beginning of 2018, there was some code in May 2018 and it was merged in October 2018, and this is not yet, at this time it was not live migration, it was only checkpoint
restore, so you could checkpoint your container, reboot your system into a newer kernel and then restore the container with the same state so it keeps running with the same memory and settings it was during checkpointing, this required many changes to run C which is one of the container runtimes Potman can use
and required CRIU changes and Potman changes of course, and then a few months ago in June the changes to implement container live migration were merged into Potman, this again required changes in all of the involved packages
in the stack to, from Potman down to CRIU and now I want to give a short demo about how it works, so the first thing I want to do, I will start a wild fly
container, this is a Java application server and it runs a really simple stateless application, it basically returns an integer and then it decreases it, so it's stateful but it's really simple and now I can say I want to
check point this container and the flags, the minus R is let the container running after the checkpointing and minus L tells Potman to work on the latest container and the export is write everything about this container's checkpoint into this file and the file contains metadata about the container, it contains
the actual checkpoint image which is mainly memory pages and it contains the file system changes to the layer the container was started with,
so all the files which were changed during runtime of the container are also included in this checkpoint archive and now I can copy the checkpoint archive to another machine, so those are two virtual machines I'm using here, those are both RHEL 8.1 beta something and I'm using Potman from GIT.
On the other machine I can now say restore, I say Potman container restore and I tell it, read it from the checkpoint and now Potman unpacks the thing
and tells RunC and CRIU to recreate the container and if I see here the last access to the container was I got a two and now if I access it here the migrated container I should get a three probably and I get a three
so the container was live migrated with the state and I can also restore the container a second time, I give it a minus N, I give it another name and if I now access the container I should again get the three
and that's the live migration of the container. Another interesting thing about this feature is you cannot only live migrate container like I did but in my case with the Java application server, the Java application server with this really simple application takes around eight seconds to start up to be able to answer requests and it takes
about four seconds to restore it from the checkpoint so in my really simple example I can increase the startup time of the container around 50% just by not having Java do all its initialization but by restarting it from the checkpoint which has already set up all the libraries
and memories loaded like Java wants it to have. On the slides there is the example written out the same I did just in the live migration right now and with that I'm all already
at the end of my presentation there are lots of links to recordings to blog posts and articles all concerning this presentation here and thanks for the attention, any questions?
Thank you for the talk. I got a question regarding uninterrupted live migration. It was a very good example of suspend and resume that you demonstrated
but if I have a live cluster and I start live migrating things, how well supported is that outside of the podman in this case, the service that runs the pod? Like what happens to my open TCP connections or anything like that? Is that managed yet or is there still work to do?
So this is from Kriya's point of view. If you can somehow migrate your IP address then the open and established TCP connections will be migrated. I think the most interesting thing is established TCP connections
and if the IP stays the same established TCP connections will stay connected. You have to migrate it within the TCP timeouts. If the machine you migrated to didn't have the container images already down there, would podman download them?
Yes, I don't do this in my demos but podman sees from the metadata in the checkpoint image on which container this is based and podman would then go out to the registry, download the container and then restore it and then do the restore based on this download hash
or whatever it downloaded there. I'm not doing it in demo because I don't know how long the download will take. Any further questions? Thank you Adrian.