Wednesday, May 14, 2008

Grid Engine 6.2 Beta Release

By Sinisa Veseli

Grid Engine 6.2 will come with some interesting new features. In addition to advance resource reservations and array job interdependencies, this release will also contain a new Service Domain Manager (SDM) module, which will allow distributing computational resources between different services, such as different Grid Engine clusters or application servers. For example, SDM will be able to withdraw unneeded machines from one cluster (or application server) and assign it to a different one or keep it in its “spare resource pool”.

It is also worth mentioning that Grid Engine (and SDM) documentation is moving to Sun’s wiki.
The 6.2 beta release is available for download here.

Sunday, May 4, 2008

About Parallel Environments in Grid Engine

By Sinisa Veseli

Support for parallel jobs in distributed resource management software is probably one of those features that most people do not use, but those who do appreciate it a lot. Grid Engine supports parallel jobs via parallel environments (PE) that can be associated with cluster queues.

New parallel environment is created using the qconf -ap <environment name> command, and editing the configuration file that pops up. Here is an example of a PE slightly modified from the default configuration:

$ qconf -sp simple_pe
pe_name           simple_pe
slots             4
user_lists        NONE
xuser_lists       NONE
start_proc_args   /bin/true
stop_proc_args    /bin/true
allocation_rule   $round_robin
control_slaves    FALSE
job_is_first_task FALSE
urgency_slots     min


In the above example, “slots” defines number of parallel tasks that can be run concurrently. The “user_lists” (“xuser_lists”) parameter should be a comma-separated list of user names that are allowed (denied) use of the given PE. If “user_lists” is set to NONE, any user that is not explicitly disallowed via the “xuser_lists” parameter.

The “start_proc_args” and “stop_proc_args” represent command line of startup and shutdown procedures for the parallel environment. These commands are usually scripts customized for a specific parallel library intended for a given PE. They get executed for each parallel job, and are used, for example, start any necessary daemons that enable parallel job execution. The standard output (error) of these commands are redirected into <job name>.po(pe).<job id> files in the job’s working directory, which is usually user’s home directory. It is worth noting that the customized PE startup and shutdown scripts can make use of several internal variables, such as $pe_hostfile and $job_id, that are relevant for the parallel job. The $pe_hostfile variable in particular points to a temporary file that contains list of machines and parallel slots allocated for the given job. For example, setting “start_proc_args” to “/bin/cp $pe_hostfile /tmp/machines.$job_id” would copy $pe_hostfile to the /tmp directory. Some of those internal variables are also available to job scripts as environment variables. In particular $PE_HOSTFILE and $JOB_ID environment variables will be set and will correspond to $pe_hostfile and $job_id, respectively.

The “allocation_rule” parameter helps scheduler decide how to distribute parallel processes among the available machines. It can take an integer that fixes the number of processes per host, or special rules like $pe_slots (all processes have to be allocated on a single host), $fill_up (start filling up slots on the best suitable host, and continue until all slots are allocated), and $round_robin (allocate slots one by one on each allocated host in a round robin fashion until all slots are filled).

The “control_slaves” parameter is slightly confusing. It indicates whether or not the Grid Engine execution daemon creates parallel tasks for a given application. In most cases (e.g., for MPI or PVM) this parameter should be set to FALSE, as custom Grid Engine PE interfaces are required for getting control of parallel tasks to work. Similarly, the “job_is_first_task” parameter is only relevant if control_slaves is set to TRUE. It indicates whether or not the original job script submitted execution is part of the parallel program.

The “urgency_slot” parameter is used for jobs that request range of parallel slots. If an integer value is specified, that number is used as prospective slot amount. If “min”, “max”, or “avg” is specified, the prospective slot amount will be determined as the minimum, maximum or average of the slot range, respectively.

After a parallel environment is configured and added to the system, it can be associated with any existing queue by setting the “pe_list” parameter in the queue configuration, and at this point users should be able to submit parallel job. On the GE project site one can find a number of nice How-To documents related to integrating various parallel libraries. If you do not have patience to build and configure one of those, but you would still like to see how stuff works, you can try adding a simple PE (like the one shown above) to one of your queues, and use a simple ssh-based master script to spawn and wait on the slave tasks:

#!/bin/sh
#$ -S /bin/sh
slaveCnt=0
while read host slots q procs; do
slotCnt=0
while [ $slotCnt -lt $slots ]; do
slotCnt=`expr $slotCnt + 1`
slaveCnt=`expr $slaveCnt + 1`
ssh $host "/bin/hostname; sleep 10" > /tmp/slave.$slaveCnt.out 2>&1  &
done
done < $PE_HOSTFILE
while [ $slaveCnt -gt 0 ]; do
wait
slaveCnt=`expr $slaveCnt - 1`
done
echo "All done!"
After saving this script as "master.sh" and submitting your job using something like "qsub -pe simple_pe 3 master.sh" (where 3 is the number of parallel slots requested), you should be able to see your "slave" tasks running on the allocated machines. Note, however, that you must have password-less ssh access to the designated parallel compute hosts in order for the above script to work.