Monday, November 17, 2008

Automating Grid Engine Monitoring

By Sinisa Veseli

When visiting client sites I often notice various issues with the existing distributed resource management software installations. The problems usually vary from configuration issues to queues in an error state. While things like inadequate resources and queue structure usually require more analysis and better design, problems like queues in an error state are easily detectable. So, cluster administrators, who are often busy with many other duties, should try to automate monitoring tasks as much as they can. For example, if you are using Grid Engine, you can easily come up with scripts like the one below, which looks for several different kinds of problems in your SGE installation:

#!/bin/sh

. /usr/local/unicluster/unicluster-user-env.sh

explainProblem() {
qHost=$1   # queue where the problem is found
msg=`qstat -f -q $qHost -explain aAEc | tail -1 | sed 's?-??g' | sed '/^$/d'`
echo $msg
}

checkProblem() {
description=$1  # problem description
signature=$2    # problem signature
for q in `qconf -sql`; do
cmd="qstat -f -q $q | grep $q | awk '{if(NF>5 && index(\$NF, \"$signature\")>0) print \$1}'"
qHostList=`eval $cmd`
if [ "$qHostList" != "" ]; then
for qHost in $qHostList; do
msg=`explainProblem $qHost`
echo "$description on $qHost:"
echo "  $msg"
echo ""
done
fi
done
}

echo "Grid Engine Issue Summary"
echo "========================="
echo ""
checkProblem Error E
checkProblem SuspendThreshold A
checkProblem Alarm a
checkProblem ConfigProblem c


Note that the above script should work with Unicluster Express 3.2 installed in the default (/usr/local/unicluster) location. It can be easily modified to, for example, send email to administrators in case problems are found that need attention. Although simple, such scripts usually go long way towards ensuring that your Grid Engine installation operates smoothly.

No comments:

Post a Comment