Checkpointing and migrating parallel applications on clusters

Kovács József <smith@sztaki.hu>

MTA SZTAKI


Checkpointing computationally intensive parallel applications are defined as saving the state of the process and the communication among them in a consistent way. During the execution of the application it is necessary to save the state of the application periodically. After checkpointing the load of the processing elements can be balanced by restoring processes on different processing elements and the computation of the application can be resumed from a specific point to avoid loss of results computed before. The latter one is also required when some processing elements fail causing the application to halt. The design of a checkpointing system should focus on keeping the user application, the message-passing environment and the operating system unchanged.


The distributed checkpointing system must face three fundamental issues. It must be able to save and restore the state of the processes, the messages in transit in the network and the connections among communicating processes. Furthermore it must ensure to avoid message duplication, message loss and to deliver messages where target processes are already migrated to a different host.


Designing a checkpointing system shows quite a large complexity and this paper introduces a possible solution that contains all the features mentioned above. The checkpointing and migrating system described here is currently implemented for the P-GRADE environment.