Tuesday, October 30, 2007

How to Decipher Grid Engine Statuses – Part I

By Sinisa Veseli

In all likelihood most of the Grid Engine (GE) end users and administrators have at some point invoked the qstat command and found themselves wondering what do some of the resulting queue and job status letters mean. While some of those letters are pretty intuitive (e.g., ‘E’ stands for error), some are not entirely trivial to decipher. Unfortunately, it does not seem to be very easy to find explanation for these statuses. One usually has to resort to digging through the qstat man pages or through the various GE software manuals that one can find on the web. So, I’ve compiled below information about possible queue statuses:



• a (alarm) – At least one of the load thresholds defined in the load_thresholds list of the queue configuration is currently exceeded. This state prevents GE from scheduling further jobs to that queue. You can find the reason for the alarm state using the qstat command with “-explain a” option.



• A (Alarm) – At least one of the suspend thresholds of the queue is currently exceeded. This state causes jobs running in that queue to be successively suspended until no threshold is violated. You can see the reason for this state using the qstat command with “-explain A” option.



• c (configuration ambiguous) – The queue instance configuration (specified in GE configuration files) is ambiguous. The state resolves when the configuration becomes unambiguous again. This state prevents you from scheduling further jobs to that queue instance. You can find detailed reasons why a queue instance entered this state in the sge_qmaster messages file, or by using the qstat command with “-explain c” option. For queue instances in this state, the cluster queue's default settings are used for the ambiguous attribute.



• C (Calendar suspended) – The queue has been suspended automatically using the GE calendar facility.



• d (disabled) – Queues are disabled and released using the qmod command. Disabling a queue will prevent new jobs to be scheduled for execution in that queue, but it will not affect jobs that are already running there.



• D (Disabled) – The queue has been disabled automatically using the GE calendar facility.



• E (Error) – The queue is in the error state. You can find the reason for this state using the qstat command with “-explain E” option.  Check that daemon's error log for information on how to resolve the problem, and clear the queue state afterwards using the qmod command with the -cq option.



• o (orphaned) – The current cluster queue's configuration and host group configuration no longer needs this queue instance. The queue instance is kept because unfinished jobs are still associated with it. The orphaned state prevents you from scheduling further jobs to that queue instance. It disappears from qstat output when these jobs finish. To help resolve an orphaned queue instance associated with a job, you use the qdel command. You can revive an orphaned queue instance by changing the cluster queue configuration so that the configuration covers that queue instance.



• s (suspended) – Queues are suspended and un-suspended using the qmod command. Suspending a queue suspends all jobs executing in that queue.



• S (Subordinate) – The queue has been suspended due to subordination to another queue. When queue is suspended, regardless of the cause, all jobs executing in that queue are suspended too.



• u (unknown) – The corresponding GE execution daemon (sge_execd) cannot be contacted.



I hope that those who are new to Grid Engine find the above descriptions useful. In Part II of this article I will cover possible job statuses.

1 comment:

  1. Thanks for the list.
    You would think there would be a simple list like this in the man pages somewhere. Or I have not found it yet.

    ReplyDelete