Wednesday, October 29, 2008

Elastic Management of Computing Clusters

By Ignacio Martin Llorente

Besides all the hype, clouds (i.e. a service for the on-demand
provision of virtual machines, others would say IaaS) are making
utility computing a reality, check for example the the Amazon EC2 case studies .
This new model, and virtualization technologies in general, is also
being actively explored by the scientific community. There are quite a
few initiatives that integrates virtualization with a range of
computing platforms, from clusters to Grid infrastructures.
Once this integration is achieved the next step is natural, jump to the
clouds and provision the VMs from an external site. For example, a
recent work from UNIVA UD has demonstrated the feasibility of supplementing a UNIVA Express cluster with EC2 resources (you can download the whitepaper to learn more).


OpenNebula virtual infrastructure engine components and its<br /> integration with Amazon EC2


This cloud provision model can be further integrated with the
in-house physical infrastructure when it is combined with a virtual
machine (VM) management system, like OpenNebula.
A VM manager is responsible for the efficient management of the virtual
infrastructure as a whole, by providing basic functionality for the
deployment, control and monitoring of VMs on a distributed pool of
resources. The use of this new virtualization layer decouples the
computing cluster from the physical infrastructure, and so extends the
classical benefits of VMs to the cluster level (i.e. cluster
consolidation, cluster isolation, cluster partitioning and elastic
cluster capacity).


Architecture of an Elastic Cluster

A computing cluster can be easily virtualized by putting the front-end
and worker nodes into VMs. In our case, the virtual cluster front-end
(SGE master host) is deployed in the local resources with Internet
connectivity to be able to communicate with Amazon EC2 VMs. This
cluster front-end acts also as NFS and NIS server for every worker node
in the virtual cluster.


The virtual worker nodes communicate with the front-end through a private local area network. The local worker nodes are connected to this vLAN through a virtual bridge configured in every physical host.  The EC2 worker nodes
are connected to the vLAN with an OpenVPN tunnel, which is established
between each remote node (OpenVPN clients) and the cluster front-end
(OpenVPN server). With this configuration, every worker node (either
local or remote) can communicate with the front-end and can use the
common network services transparently. The architecture of the cluster
is shown in the following figure:


Virtual Cluster Architecture

Figure courtesy of Prof. Rafael Moreno


Deploying a SGE cluster with OpenNebula and Amazon EC2

The last release of OpenNebula includes a driver to deploy VMs in the
EC2 cloud, and so it integrates the Amazon infrastructure with your
local resources. The EC2 is managed by OpenNebula just as another local
resource with a configurable pre-fixed size,
to limit the cluster capacity (i.e. SGE workernodes) that can be
allocated in the cloud. In this set-up, your local resources would look
like as follows:


>onehost list
HID NAME     RVM      TCPU   FCPU   ACPU    TMEM    FMEM STAT
   0 ursa01     0       800    798    800 8387584 7663616  off
   1 ursa02     0       800    798    800 8387584 7663616  off
   2 ursa03     0       800    798    800 8387584 7663616  on
   3 ursa04     2       800    798    600 8387584 6290432  on
   4 ursa05     1       800    799    700 8387584 7339008  on
   5 ec2        0       500    500    500 8912896 8912896  on

The last line corresponds to EC2, currently configured to host up to 5 m1.small instances.


The OpenNebula EC2 driver translates a general VM deployment file in
an EC2 instance description. The driver assumes that a suitable Amazon
machine image (AMI) has been previously packed and registered in the S3
storage service. So when a given VM is to be deployed in EC2 its AMI
counterpart is instantiated. A typical SGE worker node VM template
would be like this:


NAME   = sge_workernode
CPU    = 1
MEMORY = 128                                                            

#Xen or KVM template machine, used when deploying in the local resources
OS   = [kernel="/vmlinuz",initrd= "/initrd.img",root="sda1" ]
DISK = [source="/imges/sge/workernode.img",target="sda",readonly="no"]
DISK = [source="/imges/sge/workernode.swap",target="sdb",readonly="no"]
NIC  = [bridge="eth0"]

#EC2 template machine, this will be use wen submitting this VM to EC2
EC2 = [ AMI="ami-d5c226bc",
        KEYPAIR="gsg-keypair",
        AUTHORIZED_PORTS="22",
        INSTANCETYPE=m1.small]

Once deployed, the cluster would look like this (sge master, 2 local worker nodes and 2 ec2 worker nodes:


>onevm list
  ID      NAME STAT CPU     MEM        HOSTNAME        TIME
  27  sgemast runn 100 1232896          ursa05 00 00:41:57
  28  sgework runn 100 1232896          ursa04 00 00:31:45
  29  sgework runn 100 1232896          ursa04 00 00:32:33
  30  sgework runn   0       0             ec2 00 00:23:12
  31  sgework runn   0       0             ec2 00 00:21:02

You can get additional info from your ec2 VMs, like the IP, using the onvm show command


So, it is easy to manage your virtual cluster with OpenNebula and
EC2, but what about efficiency?. Besides the inherent overhead induced
by virtualization (around a 10% for processing), the average deployment
time of a remote EC2 worker node is 23.6s while a local one takes only
3.3s. Moreover, when executing a HTC workload, the overhead induced by
using EC2 (vpn, and a slower network connection) can be neglected.


Ruben S. Montero


This is a joint work with Rafael Moreno and Ignacio M. Llorente


Reprinted from blog.dsa-research.org 

Monday, October 20, 2008

Auditing the Cloud

By Rich Wellner

I've written here about the importance of SLAs for useful cloud computing platforms on a few occasions in the past. The idea behind clouds, that you can get access to resources on demand, is an appealing one. However, it is only part of the total picture. Without an ability to state what you want and go to bed, there isn't much value in the cloud.



Think about that for a minute. With the cloud computing offerings currently available there are no meaningful SLAs written down anywhere. Yet people, every day, run their production applications on an implicit SLA that is internalized something like "amazon is going to give me N units of work for M price".



There are two problems with this.



  • Amazon doesn't scale your resources. Your demand may have spiked and you are still running on the resource you signed up for.
  • There is no audit capability on EC2.
In the Cloud Computing Bill of Rights we wrote about three important attributes that need to be available to do an audit.
  • Events -- The state changes and other factors that effected your system availability.
  • Logs -- Comprehensive information about your application and its runtime environment.
  • Monitoring -- Should not be intrusive and must be limited to what the cloud provider reasonably needs in order to run their facility.

The idea here is that rather than just accepting what your cloud provider sends you at the end of the month as a bill, the world of cloud computing is complex enough that a reasonable set of runtime information must be made available to substantiate the providers claim for compensation.

This is particularly true in the world of SLAs. If my infrastructure is regularly scaling up, out, down or in to meet demands it is essential to be able to verify that the infrastructure is reacting the way that was contracted. Without that, it will be very hard to get people to trust the cloud.

Monday, October 13, 2008

Cloud and Grid are Complementary Technologies

By Ignacio Martin Llorente

There is a growing number of posts and articles trying to show how
cloud computing is a new paradigm that supersedes Grid computing by
extending its functionality and simplifying its exploitation, even
announcing that Grid computing is dead.
It seems that new technologies and paradigms have always the mission
objective to substitute existing ones. Some of these contributions do
not fully understand what grid computing is, focusing their comparative
analysis on simplicity of interfaces, implementation details or basic computing aspects. Others posts define Cloud in the same terms as Grid or create a taxonomy which includes Grid and cluster computing technologies.





Grid is as an interoperability technology, enabling
the integration and management of services and resources in a
distributed, heterogeneous environment. The technology provides support
for the deployment of different kinds of infrastructures joining
resources which belong to different administrative domains. In the
special case of a Compute Grid infrastructure, such as EGEE or TeraGrid,
Grid technology is used to federate computing resources spanning
multiple sites for job execution and data processing. There are many
success cases demonstrating that Grid technology provides the support
required to fulfill the demands of several collaborative scientific and
business processes.



On the other hand, I do not think there is a single definition for cloud computing as it denotes multiples meanings for different communities (SaaS, PaaS, IaaS...). From my view, the only new feature offered by cloud systems is the provision of virtualized resources as a service, being virtualization the enabling technology. In other words, the relevant contribution of cloud computing is the Infrastructure as a Service (IaaS) model.
Virtualization rather than other non significant issues, such as the
interfaces, is the key advance. At this point, I should remark that virtualization has been used by the Grid community before the arrival of the "Cloud".



Once I have clearly stated my position about Cloud and Grid, let me
show how I see Cloud (and virtualization as enabling technology) and
Grid as complementary technologies that will coexist and cooperate at
different levels of abstraction in future infrastructures.


There will be a Grid on top of the Cloud


Before explaining the role of cloud computing as resource provider
for Grid sites, we should understand the benefits of the virtualization
of the local infrastructure (Enterprise or Local Cloud?). How can I access on demand to a cloud provider if I have not previously virtualized my local infrastructure?.


Existing virtualization technologies allow a full separation of resource provisioning from service management.
A new virtualization layer between the service and the infrastructure
layers decouples a server not only from the underlying physical
resource but also from its physical location, without requiring any modification within service layers from both the service administrator and the end-user perspectives. Such decoupling is the key to support
the scale-out of a infrastructure in order to supplement local
resources with cloud resources to satisfy peak or fluctuating demands.



Getting back to the Grid computing case, the virtualization of a Grid site provides several benefits, which overcome many of the technical barriers for Grid adoption:


  • Easy support for VO-specific worker nodes
  • Reduce gridification cycles
  • Dynamic balance of resources between VO’s
  • Fault tolerance of key infrastructure components
  • Easier deployment and testing of new middleware distributions
  • Distribution of pre-configured components
  • Cheaper development nodes
  • Simplified training machines deployment
  • Performance partitioning between local and grid services
  • On-demand access to cloud providers

If you are interested in more details about how virtualization
and cloud computing can support compute Grid infrastructures you can
have a look at my presentation "An Introduction to Virtualization and Cloud Technologies to Support Grid Computing" (EGEE08). I also recommend the report "An EGEE Comparative study: Clouds and grids - evolution or revolution?".


There exist technology which supports the above use case. The OpenNebula engine
enables the dynamic deployment and re-allocation of virtual machines on
a pool of physical resources, providing support to access on-demand to Amazon EC2 resources. On the other hand, Globus Nimbus
provides a free, open source infrastructure for remote deployment and
management of virtual machines, allowing you to create compute clouds.


There will be a Grid under the Cloud


There is a growing interest in the federation of cloud sites. Cloud providers are opening new infrastructure centers at different geographical locations (see IBM or Amazon Availability Zones)
and it is clear that no single facility/provider can create a seemingly
infinite infrastructure capable of serving massive amounts of users at
all times, from all locations. David Wheeler once said, "Any problem in computer science can be solved with another layer of indirection… But that usually will create another problem“,
in the same line, federation of cloud sites involves many technological
and research challenges, but the good news is that some of them are not
new, and have been already studied and solved by the Grid community.


As stated above Grid is not only about computing. Grid is a technology for federation.
In the last years, there has been a huge investment in research and
development of technological components for sharing of resources across
sites. Several middleware components for file transferring, SLA
negotiation, QoS, accounting, monitoring... are available, most of them
are open-source. As also predicted by Ian Foster in his post "There's Grid in them thar Clouds",
those will be the components that could enable the federation of cloud
sites. On the other hand, other components have to be defined and
developed from scratch, mainly those related to the efficient
management of virtual machines and services within and across
administrative domains. That is exactly the aim of the Reservoir project, the European initiative in Cloud Computing.


Conclusions


In order to conclude this post let me venture some predictions about the coexistence of Grid and Cloud computing in future infrastructures:


  • Virtualization, cloud, grid and cluster are complementary
    technologies that will coexist and cooperate at different levels of
    abstraction
  • Although there are early adopters of virtualization in the
    Grid/cluster/HPC community, its full potential has not been exploited
    yet
  • In few years, the separation of job management from resource
    management through a virtualized infrastructure will be a common
    practice
  • Emerging open-source VM managers, such as OpenNebula, will contribute to speed up the adoption
  • Grid/cluster/HPC infrastructures will maintain a resource base
    scaled to meet the average workload demand and will transparently
    access to cloud providers to meet peak demands
  • Grid technology will be used for the federation of clouds

In summary, let's try to forget about hypes and concentrate on the
complementary functionality provided by both paradigms. My message to
the user community, the relevant issue is to evaluate which technology
meets your requirements. It is unlikely that a single technology will meet all
needs. My message to the Grid community, please do not see Cloud as a
threat. Virtualization and Cloud are needed to solve many of the
technical barriers for wider Grid adoption. My message to the Cloud
community, please try to take advantage of the research and development
performed by the Grid community in the last decade.


Ignacio Martín Llorente



Reprinted from blog.dsa-research.org