Monday, December 3, 2007

How to Enable Rescheduling of Grid Engine Jobs after Machine Failures

By Sinisa Veseli

Checkpointing is one of the most useful features that Grid Engine (GE) offers. As status of checkpointed jobs is periodically saved to disk, those jobs can be restarted from the checkpoint in case they do not finish for some reason (e.g., due to a system crash). In this way, any possible loss of processing for long running jobs is limited to a few minutes, as opposed to hours or even days.



When learning about Grid Engine checkpointing I found the corresponding HowTo to be extremely useful. However, this document does not contain all the details necessary to enable checkpointed job rescheduling after machine failure. If you'd like to enable that feature, you should do the following:



1) Configure your checkpointing environment using “qconf -mckpt” command (use “qconf -ackpt” for adding a new environment), and make sure that the environment’s “when” parameter includes letter ‘r’ (for “reschedule”). Alternatively, if you are using the “qmon” GUI, make sure that the “Reschedule Job” box is checked in the checkpoint object dialog box.



2) Use “qconf -mconf” command (or the “qmon” GUI) to edit the global cluster configuration and set the “reschedule_unknown” parameter to a non-zero time. This parameter determines whether jobs on hosts in unknown state are rescheduled and thus sent to other hosts. The special (default) value of 00:00:00 means that jobs will not be rescheduled from the host on which they were originally running.



3) Rescheduling is only initiated for jobs that have activated the rerun flag. Therefore, you must make sure that checkpointed jobs are submitted with “-r y” option of the “qsub” command, in addition to the “-ckpt < ckpt_env_name >” option.



Note that jobs that are not using checkpointing will be rescheduled only if they are running in queues that have the “rerun” option set to true, in addition to being submitted with “-r y” option. Parallel jobs are only rescheduled if the host on which their master task executes gets into an unknown state.

No comments:

Post a Comment