Process checkpoint consists in saving the state of a running process, so that the process can be restarted at any time later. Uses include fault tolerance, job suspend that frees memory resources, process live-migration across physical machines. Checkpoint services may checkpoint only single processes as well as full operating systems with processes, file systems, socket states, etc. This talk will present Kerrighed's application checkpoint/restart and show its advantages in flexibility over other checkpoint services. Kerrighed is a Single System Image operating system for clusters. It offers the view of a unique SMP machine on top of a cluster of standard PCs. Kerrighed is implemented as an extension to the Linux operating system (a set of modules and a patch to the kernel). Current development version is based on Linux 2.6.30. Main available features are: ◦Cluster wide process management with customizable load balancing over the cluster (through process migration and remote forking) ◦Cluster wide shared memory ◦Application checkpointing ◦Node addition/removal |