By Sinisa Veseli
You have built and installed your shiny new cluster, installed the Grid Engine software, configured the queues, and announced to the world that your new system is ready to be used. What next? Well, think about your monitoring options…
As users start submitting jobs and hammering the system in every possible way, things will inevitably break on occasion. When something goes wrong in the system, you will want to know about the problem before you start receiving help desk calls and user emails.
The first step in developing an effective strategy for monitoring Grid Engine is learning how to use the available command line tools and how to look for possible issues in the system. Some of the things that you should always pay attention to include:
• queues in the unknown state; instance queue in an unknown state usually means that execution daemon is down on that particular host
• queues and jobs in the error state
• configuration inconsistencies
• load alarms
All of the above information can be easily obtained using the qstat command (e.g., try something like “qstat -f -qs uaAcE -explain aAcE”). It is also not difficult to script basic GE monitoring tasks and come up with a simple infrastructure that is able to alert system administrators to any new or outstanding problems in the system.
As your user base grows, so will your monitoring needs, and you will likely want to extend your monitoring tools. You should consider looking into existing software packages like xml-qstat, which uses XSLT transformations to render Grid Engine command line XML output into different output formats. Alternatively, you can also develop set of your own XSL stylesheets that are customized to your needs, and use widely available command line tools such as xsltproc to generate monitoring web pages from the “qstat -xml” output.
Another interesting Grid Engine monitoring option is the Monitoring Console that comes with Cluster Express (CE). Its main advantage is that it integrates monitoring data from several different sources: Ganglia (system data), Grid Engine Qmaster and ARCo database (job data). However, even though the Cluster Express by itself is easy to install, at the moment integrating the CE Monitoring Console with existing Grid Engine installation requires a little bit of work. I am told that this will be much simplified in the upcoming CE release. In the meantime, if you are really anxious to try the CE Monitoring GUI on your Grid Engine cluster, do not hesitate to send me an email…
You can find some info (and issues I ran into...) on integrating xml-qstat with Grid Engine (in my case the one bundled with Cluster Express) at this forum post:
ReplyDeletehttp://www.grid.org/forum/showthread.php?t=102