By Sinisa Veseli
When visiting client sites I often notice various issues with the existing distributed resource management software installations. The problems usually vary from configuration issues to queues in an error state. While things like inadequate resources and queue structure usually require more analysis and better design, problems like queues in an error state are easily detectable. So, cluster administrators, who are often busy with many other duties, should try to automate monitoring tasks as much as they can. For example, if you are using Grid Engine, you can easily come up with scripts like the one below, which looks for several different kinds of problems in your SGE installation:
#!/bin/sh . /usr/local/unicluster/unicluster-user-env.sh explainProblem() { qHost=$1 # queue where the problem is found msg=`qstat -f -q $qHost -explain aAEc | tail -1 | sed 's?-??g' | sed '/^$/d'` echo $msg } checkProblem() { description=$1 # problem description signature=$2 # problem signature for q in `qconf -sql`; do cmd="qstat -f -q $q | grep $q | awk '{if(NF>5 && index(\$NF, \"$signature\")>0) print \$1}'" qHostList=`eval $cmd` if [ "$qHostList" != "" ]; then for qHost in $qHostList; do msg=`explainProblem $qHost` echo "$description on $qHost:" echo " $msg" echo "" done fi done } echo "Grid Engine Issue Summary" echo "=========================" echo "" checkProblem Error E checkProblem SuspendThreshold A checkProblem Alarm a checkProblem ConfigProblem c
Note that the above script should work with Unicluster Express 3.2 installed in the default (/usr/local/unicluster) location. It can be easily modified to, for example, send email to administrators in case problems are found that need attention. Although simple, such scripts usually go long way towards ensuring that your Grid Engine installation operates smoothly.
No comments:
Post a Comment