Technical and Scientific Computing with Grid Engine

Monday, November 17, 2008

Automating Grid Engine Monitoring

By Sinisa Veseli

When visiting client sites I often notice various issues with the existing distributed resource management software installations. The problems usually vary from configuration issues to queues in an error state. While things like inadequate resources and queue structure usually require more analysis and better design, problems like queues in an error state are easily detectable. So, cluster administrators, who are often busy with many other duties, should try to automate monitoring tasks as much as they can. For example, if you are using Grid Engine, you can easily come up with scripts like the one below, which looks for several different kinds of problems in your SGE installation:

#!/bin/sh

. /usr/local/unicluster/unicluster-user-env.sh

explainProblem() {
qHost=$1   # queue where the problem is found
msg=`qstat -f -q $qHost -explain aAEc | tail -1 | sed 's?-??g' | sed '/^$/d'`
echo $msg
}

checkProblem() {
description=$1  # problem description
signature=$2    # problem signature
for q in `qconf -sql`; do
cmd="qstat -f -q $q | grep $q | awk '{if(NF>5 && index(\$NF, \"$signature\")>0) print \$1}'"
qHostList=`eval $cmd`
if [ "$qHostList" != "" ]; then
for qHost in $qHostList; do
msg=`explainProblem $qHost`
echo "$description on $qHost:"
echo "  $msg"
echo ""
done
fi
done
}

echo "Grid Engine Issue Summary"
echo "========================="
echo ""
checkProblem Error E
checkProblem SuspendThreshold A
checkProblem Alarm a
checkProblem ConfigProblem c

Note that the above script should work with Unicluster Express 3.2 installed in the default (/usr/local/unicluster) location. It can be easily modified to, for example, send email to administrators in case problems are found that need attention. Although simple, such scripts usually go long way towards ensuring that your Grid Engine installation operates smoothly.

Thursday, November 6, 2008

Who Cares What's inside a Cloud?

By Roderick Flores

When I consider my microwave, telephone, or television I see fairly sophisticated applications that I simply plug into service providers and get useful results. If I choose to switch between individual service providers I can do so easily (assuming certain levels of deregulation of utility monopolies of course). Most importantly, while I understand how these appliances work, I would never want to build one myself. Yet I am not required to do so because the providers use standardized interfaces that appliance manufactures can easily offer: I buy my appliances as I might any other tool. Consequently, I can switch out the manufacturer or models for each of the services I use without interacting with the provider. I use these tools in a way that makes my work and life more efficient.

Nobody listens in on my conversations, nor do they receive services at my expense, I can use these services how I wish, and because of competition, I can expect an outstanding quality of service. At the end of the month, I get a bill from my providers for the services I used. These monetary costs are far outweighed by the convenience these services offer.

It is this sort of operational simplicity that motivated the first call for computational power as a utility in 1965. Like the electrical grid, a consumer would simply plug in their favorite application and use the compute power offered by a provider. Beginning in the 1990s, this effort centered around the concept of Grid computing.

Just like the early-days of electricity services, there were many issues with providing Grid computing. The very first offerings were proprietary or narrowly focused. The parallels with the electric industry are easily recognized. Some might provide street lighting whereas others would provide power for home lighting and still others for transportation and yet another group industrial applications. Moreover, each provider used different interfaces to get the power. Thus switching between providers, not a rare occurrence in a volatile industry, was no small undertaking. This, clearly was very costly for the consumer.

It took an entrepreneur to come to the industry and unify electrical services for all applications while also creating a standardized product (see http://www.eei.org/industry_issues/industry_overview_and_statistics/history for a quick overview). Similarly several visionaries had to step in and define what a Grid computer needed to do in order to create a widely consumable product. While these goals were largely met and several offerings became very successful, Grid computing never really became the firmly rooted utility-like service that we hoped for. Rather, it seems to have become an offering for specialized high-performance computing users.

This market is not the realm of service that I started thinking about early in this post. Take television service: this level of service is neither for a single viewer nor a small-business who might want to repackage a set of programs to its customers (say a sports bar). Rather it is for large-scale industries whose service requirements are unimaginable by all but a few people. I cannot even draw a parallel to television service. In telecommunication it would be the realm of a CLEC.

Furthermore, unlike my microwave, I am expected to customize my application to work well on a grid. I cannot simply plug it in and get better service than I can from my own PC. It would be the equivalent of choosing to reheat my food on my stove or building my own microwave. You see, my microwave, television service, and phone services are not just basic offerings of food preparation, entertainment, and communication. Instead, these are sophisticated systems that make my work and life easier. Grid computing, while very useful, does not simplify program implementation.

So in steps cloud computing: an emerging technology that seems to have significant overlap with grid computing while also providing simplifying services (something as a service). I may still have to assemble a microwave from pre-built pieces but everything is ready for me to use. I only have to add my personal touches to assemble a meal. It really isn't relevant whether the microwave is central to the task or just one piece of many.

When I approach a task that I hope to solve using a program, how might I plug that in just as easily? Let's quickly consider how services are provided for television. When I plug my application(TV) in to the electricity provider as well as a broadcaster of some sort, it just works. I can change the channel to the streams that I like. I can buy packages that provide me the best set of streams. In addition, some providers will offer me on-demand programming as well as internet and telephone services. If anything breaks, I call a number and they deal with it. None of this requires anything of me. I pay my bill and I get services.

Okay, how would that work for a computation? Say I want to find the inverse for a matrix. I would send out my data to the channel that inverted matrices the way I like them. The provider will worry about attaining the advertised performance, reliability, scalability, security, sustainability, device/location independence, tenancy, and capital expenditure: those characteristics of the cloud that I could not care less about. Additionally, the cloud properties that Rich Wellner assembled don't interest me much either. Certainly they may be differentiators, but the actual implementation is somebody else's problem in the same way that continuous electrical service provision is not my chief concern when I turn on the TV. What I want and will get is an inverse to the matrix I submitted in the time frame I requested deposited where I requested it to be put. I may use the inverted matrix to simultaneously solve for earthquake locations and earth properties or for material stresses and strains in a two-dimensional plate. That is my recipe and my problem.

After all, I should get services "without knowledge of, expertise with, or control over the technology infrastructure that supports them," as the cloud computing wiki page claims. Essentially the aforementioned cloud characteristics are directed towards service providers rather than to the non-expert consumer that highlights the wiki definition. Isn't the differentiator between the Cloud and the Grid the concealment of the complex infrastructure underneath? If the non-expert consumer is expected to worry about algorithm scalability, distributing data, starting and stopping resources and all of that, they certainly will need to gain some expertise quickly. Further, once they have that skill, why wouldn't they just use a mature Grid offering rather than deal with the non-standardized and chaotic clouds? Are these provider-specific characteristics not just a total rebranding of Grid?

As such, I suggest that several consumer-based characteristics should replace the rather inconsequential provider-internal ones that currently exist.

A cloud is characterized by services that:

use a specified algorithm to solve a particular problem;
can be purchased for one-time, infrequent use, or regular use;
state their peak, expected, and minimum performances;
state the expected response time;
can be queried for changes to expected response time;
support asynchronous messaging. A consumer must be able to discover when things are finished;
use standard, open, general-purpose protocols and interfaces (clearly);
have specified entry-points;
can interact with other cloud service providers. In particular, a service should be able to send output to long-term cloud-storage providers;

Now that sounds more like Computation-as-a-Service.

Monday, November 3, 2008

Cloud Computing: Commodity or Value Sale?

By Rich Wellner

There is a controversy in the cloud community today about whether the market is going to be one based on value or price. Rephrased, will cloud computing be a commodity or an enablement technology.

A poster on one of the cloud computing lists asserted that electricity would be a key component of pricing. He was then jumped on by people saying that value would be the key.

It seems like folks are talking past one another.

His assertion is true if CC is a commodity.

Now that said, there are precious few commodities in IT. Maybe internet connectivity is one. Monitors might be another. Maybe there are a few more.

But very quickly you get past swappable components that do very nearly the same job and into the realm of 'stuff' that is not easily replaceable. Then the discussion turns to one of value.

Amazon recognized the commodity of books and won the war over people who were trying to sell value. They appear to be attempting to do the same with computer time, which makes the battle they will fight over the next few years with Microsoft (and the increasing number of smaller players) extra interesting.

There is also the problem of making sweeping statements like "the market will figure things out". There is no "the market". Even on Wall Street. The reason things happen is because different people and institutions have different investment goals. Those goals vary over time and create growing or shrinking windows of opportunity for other people and institutions.

I've made my bet on how "the market" for cloud computing will shake out in the short to medium term. Now I'm just hoping that there are enough of the people and institutions my bet is predicated on in existence.

Wednesday, October 29, 2008

Elastic Management of Computing Clusters

By Ignacio Martin Llorente

Besides all the hype, clouds (i.e. a service for the on-demand
provision of virtual machines, others would say IaaS) are making
utility computing a reality, check for example the the Amazon EC2 case studies .
This new model, and virtualization technologies in general, is also
being actively explored by the scientific community. There are quite a
few initiatives that integrates virtualization with a range of
computing platforms, from clusters to Grid infrastructures.
Once this integration is achieved the next step is natural, jump to the
clouds and provision the VMs from an external site. For example, a
recent work from UNIVA UD has demonstrated the feasibility of supplementing a UNIVA Express cluster with EC2 resources (you can download the whitepaper to learn more).

This cloud provision model can be further integrated with the
in-house physical infrastructure when it is combined with a virtual
machine (VM) management system, like OpenNebula.
A VM manager is responsible for the efficient management of the virtual
infrastructure as a whole, by providing basic functionality for the
deployment, control and monitoring of VMs on a distributed pool of
resources. The use of this new virtualization layer decouples the
computing cluster from the physical infrastructure, and so extends the
classical benefits of VMs to the cluster level (i.e. cluster
consolidation, cluster isolation, cluster partitioning and elastic
cluster capacity).

Architecture of an Elastic Cluster

A computing cluster can be easily virtualized by putting the front-end
and worker nodes into VMs. In our case, the virtual cluster front-end
(SGE master host) is deployed in the local resources with Internet
connectivity to be able to communicate with Amazon EC2 VMs. This
cluster front-end acts also as NFS and NIS server for every worker node
in the virtual cluster.

The virtual worker nodes communicate with the front-end through a private local area network. The local worker nodes are connected to this vLAN through a virtual bridge configured in every physical host. The EC2 worker nodes
are connected to the vLAN with an OpenVPN tunnel, which is established
between each remote node (OpenVPN clients) and the cluster front-end
(OpenVPN server). With this configuration, every worker node (either
local or remote) can communicate with the front-end and can use the
common network services transparently. The architecture of the cluster
is shown in the following figure:

Figure courtesy of Prof. Rafael Moreno

Deploying a SGE cluster with OpenNebula and Amazon EC2

The last release of OpenNebula includes a driver to deploy VMs in the
EC2 cloud, and so it integrates the Amazon infrastructure with your
local resources. The EC2 is managed by OpenNebula just as another local
resource with a configurable pre-fixed size,
to limit the cluster capacity (i.e. SGE workernodes) that can be
allocated in the cloud. In this set-up, your local resources would look
like as follows:

>onehost list
HID NAME     RVM      TCPU   FCPU   ACPU    TMEM    FMEM STAT
   0 ursa01     0       800    798    800 8387584 7663616  off
   1 ursa02     0       800    798    800 8387584 7663616  off
   2 ursa03     0       800    798    800 8387584 7663616  on
   3 ursa04     2       800    798    600 8387584 6290432  on
   4 ursa05     1       800    799    700 8387584 7339008  on
   5 ec2        0       500    500    500 8912896 8912896  on

The last line corresponds to EC2, currently configured to host up to 5 m1.small instances.

The OpenNebula EC2 driver translates a general VM deployment file in
an EC2 instance description. The driver assumes that a suitable Amazon
machine image (AMI) has been previously packed and registered in the S3
storage service. So when a given VM is to be deployed in EC2 its AMI
counterpart is instantiated. A typical SGE worker node VM template
would be like this:

NAME   = sge_workernode
CPU    = 1
MEMORY = 128                                                            

#Xen or KVM template machine, used when deploying in the local resources
OS   = [kernel="/vmlinuz",initrd= "/initrd.img",root="sda1" ]
DISK = [source="/imges/sge/workernode.img",target="sda",readonly="no"]
DISK = [source="/imges/sge/workernode.swap",target="sdb",readonly="no"]
NIC  = [bridge="eth0"]

#EC2 template machine, this will be use wen submitting this VM to EC2
EC2 = [ AMI="ami-d5c226bc",
        KEYPAIR="gsg-keypair",
        AUTHORIZED_PORTS="22",
        INSTANCETYPE=m1.small]

Once deployed, the cluster would look like this (sge master, 2 local worker nodes and 2 ec2 worker nodes:

>onevm list
  ID      NAME STAT CPU     MEM        HOSTNAME        TIME
  27  sgemast runn 100 1232896          ursa05 00 00:41:57
  28  sgework runn 100 1232896          ursa04 00 00:31:45
  29  sgework runn 100 1232896          ursa04 00 00:32:33
  30  sgework runn   0       0             ec2 00 00:23:12
  31  sgework runn   0       0             ec2 00 00:21:02

You can get additional info from your ec2 VMs, like the IP, using the onvm show command

So, it is easy to manage your virtual cluster with OpenNebula and
EC2, but what about efficiency?. Besides the inherent overhead induced
by virtualization (around a 10% for processing), the average deployment
time of a remote EC2 worker node is 23.6s while a local one takes only
3.3s. Moreover, when executing a HTC workload, the overhead induced by
using EC2 (vpn, and a slower network connection) can be neglected.

Ruben S. Montero

This is a joint work with Rafael Moreno and Ignacio M. Llorente

Reprinted from blog.dsa-research.org

Monday, October 20, 2008

Auditing the Cloud

By Rich Wellner

I've written here about the importance of SLAs for useful cloud computing platforms on a few occasions in the past. The idea behind clouds, that you can get access to resources on demand, is an appealing one. However, it is only part of the total picture. Without an ability to state what you want and go to bed, there isn't much value in the cloud.

Think about that for a minute. With the cloud computing offerings currently available there are no meaningful SLAs written down anywhere. Yet people, every day, run their production applications on an implicit SLA that is internalized something like "amazon is going to give me N units of work for M price".

There are two problems with this.

Amazon doesn't scale your resources. Your demand may have spiked and you are still running on the resource you signed up for.
There is no audit capability on EC2.

In the Cloud Computing Bill of Rights we wrote about three important attributes that need to be available to do an audit.

Events -- The state changes and other factors that effected your system availability.
Logs -- Comprehensive information about your application and its runtime environment.
Monitoring -- Should not be intrusive and must be limited to what the cloud provider reasonably needs in order to run their facility.

The idea here is that rather than just accepting what your cloud provider sends you at the end of the month as a bill, the world of cloud computing is complex enough that a reasonable set of runtime information must be made available to substantiate the providers claim for compensation.

This is particularly true in the world of SLAs. If my infrastructure is regularly scaling up, out, down or in to meet demands it is essential to be able to verify that the infrastructure is reacting the way that was contracted. Without that, it will be very hard to get people to trust the cloud.

Monday, October 13, 2008

Cloud and Grid are Complementary Technologies

By Ignacio Martin Llorente

There is a growing number of posts and articles trying to show how
cloud computing is a new paradigm that supersedes Grid computing by
extending its functionality and simplifying its exploitation, even
announcing that Grid computing is dead.
It seems that new technologies and paradigms have always the mission
objective to substitute existing ones. Some of these contributions do
not fully understand what grid computing is, focusing their comparative
analysis on simplicity of interfaces, implementation details or basic computing aspects. Others posts define Cloud in the same terms as Grid or create a taxonomy which includes Grid and cluster computing technologies.

Grid is as an interoperability technology, enabling
the integration and management of services and resources in a
distributed, heterogeneous environment. The technology provides support
for the deployment of different kinds of infrastructures joining
resources which belong to different administrative domains. In the
special case of a Compute Grid infrastructure, such as EGEE or TeraGrid,
Grid technology is used to federate computing resources spanning
multiple sites for job execution and data processing. There are many
success cases demonstrating that Grid technology provides the support
required to fulfill the demands of several collaborative scientific and
business processes.

On the other hand, I do not think there is a single definition for cloud computing as it denotes multiples meanings for different communities (SaaS, PaaS, IaaS...). From my view, the only new feature offered by cloud systems is the provision of virtualized resources as a service, being virtualization the enabling technology. In other words, the relevant contribution of cloud computing is the Infrastructure as a Service (IaaS) model.
Virtualization rather than other non significant issues, such as the
interfaces, is the key advance. At this point, I should remark that virtualization has been used by the Grid community before the arrival of the "Cloud".

Once I have clearly stated my position about Cloud and Grid, let me
show how I see Cloud (and virtualization as enabling technology) and
Grid as complementary technologies that will coexist and cooperate at
different levels of abstraction in future infrastructures.

There will be a Grid on top of the Cloud

Before explaining the role of cloud computing as resource provider
for Grid sites, we should understand the benefits of the virtualization
of the local infrastructure (Enterprise or Local Cloud?). How can I access on demand to a cloud provider if I have not previously virtualized my local infrastructure?.

Existing virtualization technologies allow a full separation of resource provisioning from service management.
A new virtualization layer between the service and the infrastructure
layers decouples a server not only from the underlying physical
resource but also from its physical location, without requiring any modification within service layers from both the service administrator and the end-user perspectives. Such decoupling is the key to support
the scale-out of a infrastructure in order to supplement local
resources with cloud resources to satisfy peak or fluctuating demands.

Getting back to the Grid computing case, the virtualization of a Grid site provides several benefits, which overcome many of the technical barriers for Grid adoption:

Easy support for VO-specific worker nodes
Reduce gridification cycles
Dynamic balance of resources between VO’s
Fault tolerance of key infrastructure components
Easier deployment and testing of new middleware distributions
Distribution of pre-configured components
Cheaper development nodes
Simplified training machines deployment
Performance partitioning between local and grid services
On-demand access to cloud providers

If you are interested in more details about how virtualization
and cloud computing can support compute Grid infrastructures you can
have a look at my presentation "An Introduction to Virtualization and Cloud Technologies to Support Grid Computing" (EGEE08). I also recommend the report "An EGEE Comparative study: Clouds and grids - evolution or revolution?".

There exist technology which supports the above use case. The OpenNebula engine
enables the dynamic deployment and re-allocation of virtual machines on
a pool of physical resources, providing support to access on-demand to Amazon EC2 resources. On the other hand, Globus Nimbus
provides a free, open source infrastructure for remote deployment and
management of virtual machines, allowing you to create compute clouds.

There will be a Grid under the Cloud

There is a growing interest in the federation of cloud sites. Cloud providers are opening new infrastructure centers at different geographical locations (see IBM or Amazon Availability Zones)
and it is clear that no single facility/provider can create a seemingly
infinite infrastructure capable of serving massive amounts of users at
all times, from all locations. David Wheeler once said, "Any problem in computer science can be solved with another layer of indirection… But that usually will create another problem“,
in the same line, federation of cloud sites involves many technological
and research challenges, but the good news is that some of them are not
new, and have been already studied and solved by the Grid community.

As stated above Grid is not only about computing. Grid is a technology for federation.
In the last years, there has been a huge investment in research and
development of technological components for sharing of resources across
sites. Several middleware components for file transferring, SLA
negotiation, QoS, accounting, monitoring... are available, most of them
are open-source. As also predicted by Ian Foster in his post "There's Grid in them thar Clouds",
those will be the components that could enable the federation of cloud
sites. On the other hand, other components have to be defined and
developed from scratch, mainly those related to the efficient
management of virtual machines and services within and across
administrative domains. That is exactly the aim of the Reservoir project, the European initiative in Cloud Computing.

Conclusions

In order to conclude this post let me venture some predictions about the coexistence of Grid and Cloud computing in future infrastructures:

Virtualization, cloud, grid and cluster are complementary
technologies that will coexist and cooperate at different levels of
abstraction
Although there are early adopters of virtualization in the
Grid/cluster/HPC community, its full potential has not been exploited
yet
In few years, the separation of job management from resource
management through a virtualized infrastructure will be a common
practice
Emerging open-source VM managers, such as OpenNebula, will contribute to speed up the adoption
Grid/cluster/HPC infrastructures will maintain a resource base
scaled to meet the average workload demand and will transparently
access to cloud providers to meet peak demands
Grid technology will be used for the federation of clouds

In summary, let's try to forget about hypes and concentrate on the
complementary functionality provided by both paradigms. My message to
the user community, the relevant issue is to evaluate which technology
meets your requirements. It is unlikely that a single technology will meet all
needs. My message to the Grid community, please do not see Cloud as a
threat. Virtualization and Cloud are needed to solve many of the
technical barriers for wider Grid adoption. My message to the Cloud
community, please try to take advantage of the research and development
performed by the Grid community in the last decade.

Ignacio Martín Llorente

Reprinted from blog.dsa-research.org

Wednesday, September 17, 2008

The OpenNebula Engine for Data Center Virtualization and Cloud Solutions

By Ignacio Martin Llorente

Virtualization has opened up avenues for new resource management
techniques within the data center. Probably, the most important
characteristic is its ability to dynamically shape a given hardware
infrastructure to support different services with varying workloads.
Therefore, effectively decoupling the management of the service (for
example a web server or a computing cluster) from the management of the
infrastructure (e.g. the resources allocated to each service or the
interconnection network).

A
key component in this scenario is the virtual machine manager. A VM
manager is responsible for the efficient management of the virtual
infrastructure as a whole, by providing basic functionality for the
deployment, control and monitoring of VMs on a distributed pool of
resources. Usually, these VM managers also offer high availability
capabilities and scheduling policies for VM placement and physical
resource selection. Taking advantage of the underlying virtualization
technologies and according to a set of predefined policies, the VM
manager is able to adapt the physical infrastructure to the services it
supports and their current load. This adaptation usually involves the
deployment of new VMs or the migration of running VMs to optimize their
placement.

The dsa-research group
at the Universidad Complutense de Madrid has released under the terms
of the Apache License, Version 2.0, the first stable version of the OpenNebula Virtual Infrastructure Engine.
OpenNebula enables the dynamic allocation of virtual machines on a pool
of physical resources, so extending the benefits of existing
virtualization platforms from a single physical resource to a pool of
resources, decoupling the server not only from the physical
infrastructure but also from the physical location. OpenNebula is a
component being enhanced within the context of the RESERVOIR European Project.

The new VM manger differentiates from existing VM managers in its
highly modular and open architecture designed to meet the requirements
of cluster administrators. OpenNebula 1.0 supports Xen and KVM
virtualization platforms to provide several features and capabilities
for VM dynamic management, such as centralized management, efficient
resource management, powerful API and CLI interfaces for monitoring and
controlling VMs and physical resources, fault tolerant design... Two of
the outstanding new features are its support for advance reservation
leases and on-demand access to remote cloud provider

Support for Advance Reservation Leases

Haizea
is an open source lease management architecture that OpenNebula can use
as a scheduling backend. Haizea uses leases as a fundamental resource
provisioning abstraction, and implements those leases as virtual
machines, taking into account the overhead of using virtual machines
(e.g., deploying a disk image for a VM) when scheduling leases. Using
OpenNebula with Haizea allows resource providers to lease their
resources, using potentially complex lease terms, instead of only
allowing users to request VMs that must start immediately.

Support to Access on-Demand to Amazon EC2 resources

Recently, virtualization has also brought about a new utility
computing model, called cloud computing, for the on-demand provision of
virtualized resources as a service. The Amazon Elastic Compute Cloudi
s probably the best example of this new paradigm for the elastic
capacity providing. Thanks to virtualization, the clouds can be used
efficiently to supplement local capacity with outsourced resources. The
joint use of these two technologies, VM managers and clouds, will
change arguably the structure and economics of current data centers.
OpenNebula provides support to access Amazon EC2 resources to
supplement local resources with cloud resources to satisfy peak or
fluctuating demands.

Scale-out of Computing Clusters with OpenNebula and Amazon EC2

As use case to illustrate the new capabilities provided by OpenNebula, the release includes documentation
about the application of this new paradigm (i.e. the combination of VM
managers and cloud computing) to a computing cluster, a typical data
center service. The use of a new virtualization layer between the
computing cluster and the physical infrastructure extends the classical
benefits of VMs to the computing cluster, so providing cluster
consolidation, cluster partitioning and support for heterogeneous
workloads. Moreover, the integration of the cloud in this layer allows
the cluster to grow on-demand with additional computational resources
to satisfy peak demands.

I gnacio Martín Llorente

Reprinted from blog.dsa-research.org