Technical and Scientific Computing with Grid Engine

Friday, December 5, 2008

Using Grid Engine Subordinate Queues

By Sinisa Veseli

It is often the case that certain classes of computing jobs have high priority and require immediate access to cluster resources in order to complete on time. For such jobs one could reserve a special queue with assigned set of nodes, but this solution might waste resources when there are no high priority jobs running. Alternative approach to this problem in Grid Engine involves using subordinate queues, which allow preemption of job slots. The way this works is as follows:

A higher priority queue is configured with one or more subordinates queues
Jobs running in subordinated queues are suspended when the higher priority queue becomes busy, and they are resumed when the higher priority queue is not busy any longer
For any subordinate queue one can configure number of job slots (“Max Slots” parameter in the qmon “Subordinates” tab for the higher priority queue configuration) tab that must be filled in the higher priority queue to trigger a suspension. If “max slots” is not specified then all job slots must be filled in the higher priority queue to trigger suspension of the subordinate queue.

In order to illustrate this, on my test machine I’ve setup three queues (“low”, “medium” and “high”) intended to run jobs with different priorities. The “high” queue has both “low” and “medium” queues as subordinates, while the “medium” queue has “low” as its subordinate:

user@sgetest> qconf -sq high | grep subordinate
subordinate_list      low medium
user@sgetest> qconf -sq medium | grep subordinate
subordinate_list      low
user@sgetest> qconf -sq low | grep subordinate
subordinate_list      NONE

After submitting a low priority array job to the “low” queue, qstat returns the following information:

user@sgetest> qsub -t 1-10 -q low low_priority_job.sh
Your job-array 19.1-10:1 ("low_priority_job.sh") has been submitted
user@sgetest> qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
high@sgetest.univaud.com    BIP   0/2       0.07     lx24-amd64
----------------------------------------------------------------------------
medium@sgetest.univaud.com    BIP   0/2       0.07     lx24-amd64
----------------------------------------------------------------------------
low@sgetest.univaud.com    BIP   2/2       0.07     lx24-amd64
19 0.55500 low_priori user       r     11/24/2008 17:05:02     1 1
19 0.55500 low_priori user       r     11/24/2008 17:05:02     1 2

############################################################################
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
19 0.55500 low_priori user       qw    11/24/2008 17:04:50     1 3-10:1

Note that all available job slots on my test machine are full. Submission of the medium priority array job to the “medium” queue results in suspension of the previously running low priority tasks (this is indicated by the letter “S” next to the task listing in the qstat output):

user@sgetest> qsub -t 1-10 -q medium medium_priority_job.sh
Your job-array 20.1-10:1 ("medium_priority_job.sh") has been submitted
user@sgetest> qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
high@sgetest.univaud.com    BIP   0/2       0.06     lx24-amd64
----------------------------------------------------------------------------
medium@sgetest.univaud.com    BIP   2/2       0.06     lx24-amd64
20 0.55500 medium_pri user       r     11/24/2008 17:05:17     1 1
20 0.55500 medium_pri user       r     11/24/2008 17:05:17     1 2
----------------------------------------------------------------------------
low@sgetest.univaud.com    BIP   2/2       0.06     lx24-amd64    S
19 0.55500 low_priori user       S     11/24/2008 17:05:02     1 1
19 0.55500 low_priori user       S     11/24/2008 17:05:02     1 2

############################################################################
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
19 0.55500 low_priori user       qw    11/24/2008 17:04:50     1 3-10:1
20 0.55500 medium_pri user       qw    11/24/2008 17:05:15     1 3-10:1

Finally, submission of a high priority array job to the “high” queue results in previously running medium priority tasks to be suspended:

user@sgetest> qsub -t 1-10 -q high high_priority_job.sh
Your job-array 21.1-10:1 ("high_priority_job.sh") has been submitted
user@sgetest> qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
high@sgetest.univaud.com    BIP   2/2       0.06     lx24-amd64
21 0.55500 high_prior user       r     11/24/2008 17:06:02     1 1
21 0.55500 high_prior user       r     11/24/2008 17:06:02     1 2
----------------------------------------------------------------------------
medium@sgetest.univaud.com    BIP   2/2       0.06     lx24-amd64    S
20 0.55500 medium_pri user       S     11/24/2008 17:05:17     1 1
20 0.55500 medium_pri user       S     11/24/2008 17:05:17     1 2
----------------------------------------------------------------------------
low@sgetest.univaud.com    BIP   2/2       0.06     lx24-amd64    S
19 0.55500 low_priori user       S     11/24/2008 17:05:02     1 1
19 0.55500 low_priori user       S     11/24/2008 17:05:02     1 2

############################################################################
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
19 0.55500 low_priori user       qw    11/24/2008 17:04:50     1 3-10:1
20 0.55500 medium_pri user       qw    11/24/2008 17:05:15     1 3-10:1
21 0.00000 high_prior user       qw    11/24/2008 17:05:52     1 3-10:1

Medium priority tasks will be resumed after all high priority tasks are done, and low priority tasks will run after medium priority job is finished.

One thing worthy of pointing out is that Grid Engine queue subordination is implemented on the “instance queue” level. In other words, if I had machine "A" associated with my queue “low”, but not with queues “high” or “medium”, jobs running on machine "A" would not be suspended even if there were higher priority jobs waiting to be scheduled.

Tuesday, December 2, 2008

Managing Resource Quotas in Grid Engine

By Sinisa Veseli

It is often the case that cluster administrators must impose limits on using certain resources. Good example here would be preventing a particular user (or a set of users), from utilizing entire queue (or cluster) at any point. If you’ve ever tried doing something like that for Grid Engine (SGE), then you know that it is not immediately obvious how to impose limits on resource usage.

SGE has a concept of “resource quota sets” (RQS), which can be used to limit maximum resource consumption by any job. The relevant qconf command line switches for manipulating resource quota sets are “-srqs” and “-srqsl” (show), “-arqs” (add), “-mrqs” (modify) and “-drqs” (delete).

Each RQS must have the following parameters: name, description, enabled and limit. RQS name cannot have spaces, but its description can be an arbitrary string. The boolean “enabled” flag specifies whether the RQS is enabled or not, while the “limit” field denotes resource quota rule that consists of an optional name, filters for a specific job request and the resource quota limit. Note that one can have multiple “limit” fields associated with a given RQS. For example, the following RQS prevents user “ahogger” to occupy more than 1 job slot in general, and it also limits the same user from running jobs in the headnodes.q queue:

$ qconf -srqs ahogger_job_limit
{
name         ahogger_job_limit
description  "limit ahogger jobs"
enabled      TRUE
limit        users ahogger to slots=1
limit        users {ahogger} queues {headnodes.q} to slots=0
}

The exact format in which RQS have to be specified is, like everything else, well documented in SGE man pages (“man sge_resource_quota”).

Monday, November 17, 2008

Automating Grid Engine Monitoring

By Sinisa Veseli

When visiting client sites I often notice various issues with the existing distributed resource management software installations. The problems usually vary from configuration issues to queues in an error state. While things like inadequate resources and queue structure usually require more analysis and better design, problems like queues in an error state are easily detectable. So, cluster administrators, who are often busy with many other duties, should try to automate monitoring tasks as much as they can. For example, if you are using Grid Engine, you can easily come up with scripts like the one below, which looks for several different kinds of problems in your SGE installation:

#!/bin/sh

. /usr/local/unicluster/unicluster-user-env.sh

explainProblem() {
qHost=$1   # queue where the problem is found
msg=`qstat -f -q $qHost -explain aAEc | tail -1 | sed 's?-??g' | sed '/^$/d'`
echo $msg
}

checkProblem() {
description=$1  # problem description
signature=$2    # problem signature
for q in `qconf -sql`; do
cmd="qstat -f -q $q | grep $q | awk '{if(NF>5 && index(\$NF, \"$signature\")>0) print \$1}'"
qHostList=`eval $cmd`
if [ "$qHostList" != "" ]; then
for qHost in $qHostList; do
msg=`explainProblem $qHost`
echo "$description on $qHost:"
echo "  $msg"
echo ""
done
fi
done
}

echo "Grid Engine Issue Summary"
echo "========================="
echo ""
checkProblem Error E
checkProblem SuspendThreshold A
checkProblem Alarm a
checkProblem ConfigProblem c

Note that the above script should work with Unicluster Express 3.2 installed in the default (/usr/local/unicluster) location. It can be easily modified to, for example, send email to administrators in case problems are found that need attention. Although simple, such scripts usually go long way towards ensuring that your Grid Engine installation operates smoothly.

Thursday, November 6, 2008

Who Cares What's inside a Cloud?

By Roderick Flores

When I consider my microwave, telephone, or television I see fairly sophisticated applications that I simply plug into service providers and get useful results. If I choose to switch between individual service providers I can do so easily (assuming certain levels of deregulation of utility monopolies of course). Most importantly, while I understand how these appliances work, I would never want to build one myself. Yet I am not required to do so because the providers use standardized interfaces that appliance manufactures can easily offer: I buy my appliances as I might any other tool. Consequently, I can switch out the manufacturer or models for each of the services I use without interacting with the provider. I use these tools in a way that makes my work and life more efficient.

Nobody listens in on my conversations, nor do they receive services at my expense, I can use these services how I wish, and because of competition, I can expect an outstanding quality of service. At the end of the month, I get a bill from my providers for the services I used. These monetary costs are far outweighed by the convenience these services offer.

It is this sort of operational simplicity that motivated the first call for computational power as a utility in 1965. Like the electrical grid, a consumer would simply plug in their favorite application and use the compute power offered by a provider. Beginning in the 1990s, this effort centered around the concept of Grid computing.

Just like the early-days of electricity services, there were many issues with providing Grid computing. The very first offerings were proprietary or narrowly focused. The parallels with the electric industry are easily recognized. Some might provide street lighting whereas others would provide power for home lighting and still others for transportation and yet another group industrial applications. Moreover, each provider used different interfaces to get the power. Thus switching between providers, not a rare occurrence in a volatile industry, was no small undertaking. This, clearly was very costly for the consumer.

It took an entrepreneur to come to the industry and unify electrical services for all applications while also creating a standardized product (see http://www.eei.org/industry_issues/industry_overview_and_statistics/history for a quick overview). Similarly several visionaries had to step in and define what a Grid computer needed to do in order to create a widely consumable product. While these goals were largely met and several offerings became very successful, Grid computing never really became the firmly rooted utility-like service that we hoped for. Rather, it seems to have become an offering for specialized high-performance computing users.

This market is not the realm of service that I started thinking about early in this post. Take television service: this level of service is neither for a single viewer nor a small-business who might want to repackage a set of programs to its customers (say a sports bar). Rather it is for large-scale industries whose service requirements are unimaginable by all but a few people. I cannot even draw a parallel to television service. In telecommunication it would be the realm of a CLEC.

Furthermore, unlike my microwave, I am expected to customize my application to work well on a grid. I cannot simply plug it in and get better service than I can from my own PC. It would be the equivalent of choosing to reheat my food on my stove or building my own microwave. You see, my microwave, television service, and phone services are not just basic offerings of food preparation, entertainment, and communication. Instead, these are sophisticated systems that make my work and life easier. Grid computing, while very useful, does not simplify program implementation.

So in steps cloud computing: an emerging technology that seems to have significant overlap with grid computing while also providing simplifying services (something as a service). I may still have to assemble a microwave from pre-built pieces but everything is ready for me to use. I only have to add my personal touches to assemble a meal. It really isn't relevant whether the microwave is central to the task or just one piece of many.

When I approach a task that I hope to solve using a program, how might I plug that in just as easily? Let's quickly consider how services are provided for television. When I plug my application(TV) in to the electricity provider as well as a broadcaster of some sort, it just works. I can change the channel to the streams that I like. I can buy packages that provide me the best set of streams. In addition, some providers will offer me on-demand programming as well as internet and telephone services. If anything breaks, I call a number and they deal with it. None of this requires anything of me. I pay my bill and I get services.

Okay, how would that work for a computation? Say I want to find the inverse for a matrix. I would send out my data to the channel that inverted matrices the way I like them. The provider will worry about attaining the advertised performance, reliability, scalability, security, sustainability, device/location independence, tenancy, and capital expenditure: those characteristics of the cloud that I could not care less about. Additionally, the cloud properties that Rich Wellner assembled don't interest me much either. Certainly they may be differentiators, but the actual implementation is somebody else's problem in the same way that continuous electrical service provision is not my chief concern when I turn on the TV. What I want and will get is an inverse to the matrix I submitted in the time frame I requested deposited where I requested it to be put. I may use the inverted matrix to simultaneously solve for earthquake locations and earth properties or for material stresses and strains in a two-dimensional plate. That is my recipe and my problem.

After all, I should get services "without knowledge of, expertise with, or control over the technology infrastructure that supports them," as the cloud computing wiki page claims. Essentially the aforementioned cloud characteristics are directed towards service providers rather than to the non-expert consumer that highlights the wiki definition. Isn't the differentiator between the Cloud and the Grid the concealment of the complex infrastructure underneath? If the non-expert consumer is expected to worry about algorithm scalability, distributing data, starting and stopping resources and all of that, they certainly will need to gain some expertise quickly. Further, once they have that skill, why wouldn't they just use a mature Grid offering rather than deal with the non-standardized and chaotic clouds? Are these provider-specific characteristics not just a total rebranding of Grid?

As such, I suggest that several consumer-based characteristics should replace the rather inconsequential provider-internal ones that currently exist.

A cloud is characterized by services that:

use a specified algorithm to solve a particular problem;
can be purchased for one-time, infrequent use, or regular use;
state their peak, expected, and minimum performances;
state the expected response time;
can be queried for changes to expected response time;
support asynchronous messaging. A consumer must be able to discover when things are finished;
use standard, open, general-purpose protocols and interfaces (clearly);
have specified entry-points;
can interact with other cloud service providers. In particular, a service should be able to send output to long-term cloud-storage providers;

Now that sounds more like Computation-as-a-Service.

Monday, November 3, 2008

Cloud Computing: Commodity or Value Sale?

By Rich Wellner

There is a controversy in the cloud community today about whether the market is going to be one based on value or price. Rephrased, will cloud computing be a commodity or an enablement technology.

A poster on one of the cloud computing lists asserted that electricity would be a key component of pricing. He was then jumped on by people saying that value would be the key.

It seems like folks are talking past one another.

His assertion is true if CC is a commodity.

Now that said, there are precious few commodities in IT. Maybe internet connectivity is one. Monitors might be another. Maybe there are a few more.

But very quickly you get past swappable components that do very nearly the same job and into the realm of 'stuff' that is not easily replaceable. Then the discussion turns to one of value.

Amazon recognized the commodity of books and won the war over people who were trying to sell value. They appear to be attempting to do the same with computer time, which makes the battle they will fight over the next few years with Microsoft (and the increasing number of smaller players) extra interesting.

There is also the problem of making sweeping statements like "the market will figure things out". There is no "the market". Even on Wall Street. The reason things happen is because different people and institutions have different investment goals. Those goals vary over time and create growing or shrinking windows of opportunity for other people and institutions.

I've made my bet on how "the market" for cloud computing will shake out in the short to medium term. Now I'm just hoping that there are enough of the people and institutions my bet is predicated on in existence.

Wednesday, October 29, 2008

Elastic Management of Computing Clusters

By Ignacio Martin Llorente

Besides all the hype, clouds (i.e. a service for the on-demand
provision of virtual machines, others would say IaaS) are making
utility computing a reality, check for example the the Amazon EC2 case studies .
This new model, and virtualization technologies in general, is also
being actively explored by the scientific community. There are quite a
few initiatives that integrates virtualization with a range of
computing platforms, from clusters to Grid infrastructures.
Once this integration is achieved the next step is natural, jump to the
clouds and provision the VMs from an external site. For example, a
recent work from UNIVA UD has demonstrated the feasibility of supplementing a UNIVA Express cluster with EC2 resources (you can download the whitepaper to learn more).

This cloud provision model can be further integrated with the
in-house physical infrastructure when it is combined with a virtual
machine (VM) management system, like OpenNebula.
A VM manager is responsible for the efficient management of the virtual
infrastructure as a whole, by providing basic functionality for the
deployment, control and monitoring of VMs on a distributed pool of
resources. The use of this new virtualization layer decouples the
computing cluster from the physical infrastructure, and so extends the
classical benefits of VMs to the cluster level (i.e. cluster
consolidation, cluster isolation, cluster partitioning and elastic
cluster capacity).

Architecture of an Elastic Cluster

A computing cluster can be easily virtualized by putting the front-end
and worker nodes into VMs. In our case, the virtual cluster front-end
(SGE master host) is deployed in the local resources with Internet
connectivity to be able to communicate with Amazon EC2 VMs. This
cluster front-end acts also as NFS and NIS server for every worker node
in the virtual cluster.

The virtual worker nodes communicate with the front-end through a private local area network. The local worker nodes are connected to this vLAN through a virtual bridge configured in every physical host. The EC2 worker nodes
are connected to the vLAN with an OpenVPN tunnel, which is established
between each remote node (OpenVPN clients) and the cluster front-end
(OpenVPN server). With this configuration, every worker node (either
local or remote) can communicate with the front-end and can use the
common network services transparently. The architecture of the cluster
is shown in the following figure:

Figure courtesy of Prof. Rafael Moreno

Deploying a SGE cluster with OpenNebula and Amazon EC2

The last release of OpenNebula includes a driver to deploy VMs in the
EC2 cloud, and so it integrates the Amazon infrastructure with your
local resources. The EC2 is managed by OpenNebula just as another local
resource with a configurable pre-fixed size,
to limit the cluster capacity (i.e. SGE workernodes) that can be
allocated in the cloud. In this set-up, your local resources would look
like as follows:

>onehost list
HID NAME     RVM      TCPU   FCPU   ACPU    TMEM    FMEM STAT
   0 ursa01     0       800    798    800 8387584 7663616  off
   1 ursa02     0       800    798    800 8387584 7663616  off
   2 ursa03     0       800    798    800 8387584 7663616  on
   3 ursa04     2       800    798    600 8387584 6290432  on
   4 ursa05     1       800    799    700 8387584 7339008  on
   5 ec2        0       500    500    500 8912896 8912896  on

The last line corresponds to EC2, currently configured to host up to 5 m1.small instances.

The OpenNebula EC2 driver translates a general VM deployment file in
an EC2 instance description. The driver assumes that a suitable Amazon
machine image (AMI) has been previously packed and registered in the S3
storage service. So when a given VM is to be deployed in EC2 its AMI
counterpart is instantiated. A typical SGE worker node VM template
would be like this:

NAME   = sge_workernode
CPU    = 1
MEMORY = 128                                                            

#Xen or KVM template machine, used when deploying in the local resources
OS   = [kernel="/vmlinuz",initrd= "/initrd.img",root="sda1" ]
DISK = [source="/imges/sge/workernode.img",target="sda",readonly="no"]
DISK = [source="/imges/sge/workernode.swap",target="sdb",readonly="no"]
NIC  = [bridge="eth0"]

#EC2 template machine, this will be use wen submitting this VM to EC2
EC2 = [ AMI="ami-d5c226bc",
        KEYPAIR="gsg-keypair",
        AUTHORIZED_PORTS="22",
        INSTANCETYPE=m1.small]

Once deployed, the cluster would look like this (sge master, 2 local worker nodes and 2 ec2 worker nodes:

>onevm list
  ID      NAME STAT CPU     MEM        HOSTNAME        TIME
  27  sgemast runn 100 1232896          ursa05 00 00:41:57
  28  sgework runn 100 1232896          ursa04 00 00:31:45
  29  sgework runn 100 1232896          ursa04 00 00:32:33
  30  sgework runn   0       0             ec2 00 00:23:12
  31  sgework runn   0       0             ec2 00 00:21:02

You can get additional info from your ec2 VMs, like the IP, using the onvm show command

So, it is easy to manage your virtual cluster with OpenNebula and
EC2, but what about efficiency?. Besides the inherent overhead induced
by virtualization (around a 10% for processing), the average deployment
time of a remote EC2 worker node is 23.6s while a local one takes only
3.3s. Moreover, when executing a HTC workload, the overhead induced by
using EC2 (vpn, and a slower network connection) can be neglected.

Ruben S. Montero

This is a joint work with Rafael Moreno and Ignacio M. Llorente

Reprinted from blog.dsa-research.org

Monday, October 20, 2008

Auditing the Cloud

By Rich Wellner

I've written here about the importance of SLAs for useful cloud computing platforms on a few occasions in the past. The idea behind clouds, that you can get access to resources on demand, is an appealing one. However, it is only part of the total picture. Without an ability to state what you want and go to bed, there isn't much value in the cloud.

Think about that for a minute. With the cloud computing offerings currently available there are no meaningful SLAs written down anywhere. Yet people, every day, run their production applications on an implicit SLA that is internalized something like "amazon is going to give me N units of work for M price".

There are two problems with this.

Amazon doesn't scale your resources. Your demand may have spiked and you are still running on the resource you signed up for.
There is no audit capability on EC2.

In the Cloud Computing Bill of Rights we wrote about three important attributes that need to be available to do an audit.

Events -- The state changes and other factors that effected your system availability.
Logs -- Comprehensive information about your application and its runtime environment.
Monitoring -- Should not be intrusive and must be limited to what the cloud provider reasonably needs in order to run their facility.

The idea here is that rather than just accepting what your cloud provider sends you at the end of the month as a bill, the world of cloud computing is complex enough that a reasonable set of runtime information must be made available to substantiate the providers claim for compensation.

This is particularly true in the world of SLAs. If my infrastructure is regularly scaling up, out, down or in to meet demands it is essential to be able to verify that the infrastructure is reacting the way that was contracted. Without that, it will be very hard to get people to trust the cloud.