Wednesday, May 6, 2009

Parsing SGE Accounting File

By Rich Wellner

Anyone managing an HPC cluster has probably wondered at some point about the overall performance and usage of his/her cluster. How many jobs were completed last month, what was the average job duration time, how long were they pending in queue, how many CPU slots did jobs require…? These are all good questions with answers buried somewhere in your DRM’s accounting files.

If you are using the Grid Engine, and assuming you have the usual “default cell” installation, the relevant file is $SGE_ROOT/default/common/accounting. The corresponding command that extracts information from this file is “qacct”. When you type something like “man qacct”, you will notice that qacct produces a summary of information for wall-clock, cpu and system time, and for different categories of such as hostname, queue-name, owner-name, etc., so that there is a good chance that information you are looking for is readily available. If, however, you happen to look for something that qacct does not provide, the accounting file is formatted for easy parsing. Each line in the file corresponds to one computing task, and there are more than different 40 accounting fields (separated by the ‘:’ character) on each line. The meaning of different fields is documented in the man pages (“man accounting”), so that getting information you need with standard UNIX tools should not be difficult at all.