Technical and Scientific Computing with Grid Engine: February 2008

Wednesday, February 27, 2008

Breaking Out of the Core

By Roderick Flores

I think that one of the most exciting consequences of the rise of multicore is the possibility of overcoming the limitations of the WAN by processing where you collect your data. It is exceptionally difficult and/or expensive to move large amounts of data from one distant site to another regardless of the processing capability you might gain. Paul Wallis has an excellent discussion about the economics and other key issues that the business community faces with computing on "The Cloud" in his blog Keystones and Rivets.

So how do cores help us get passed the relatively high costs of the WAN? The first signs of this trend will be wherever significant amounts of data are collected out in the field. Currently you have a number of options, none of them great, for retrieving your data for processing. These include:

Provision the bandwidth required to move the data, typically at significant cost.
Significantly reduce the size or quality of the data and transmit it more affordably.
Write the data to media and collect it on a regular basis

There never really was much consideration given to processing the data in situ because the computational power just was not there. Multicore processors have allowed us to rethink this.

For example, consider one of the most sought after goals in a hot industry: near-real time monitoring of a reservoir for oil-production and/or for CO₂ sequestration. (see the Intelligent Oilfield, IPCC Special Report on Carbon dioxide Capture and Storage) The areas where this is most desired tend to be fairly remote such as offshore or in the middle of inhospitable deserts. There is no network connectivity to speak of to these areas let alone enough to move data from a large multi-component ocean-bottom seismic array like those found in the North Sea.

Consequently, a colleague of mine and I were tasked with how we might implement the company’s processing pipelines in the field. Instead of processing the data using hundreds of processors and an equivalent number of terabytes of storage everything needed to fit on ***maybe*** as much as a single computer rack. Our proposal had to include power conditioning and backup, storage, processing nodes, management nodes (e.g. resource managers), as well as nodes for user interaction. Electrical circuit size limitations also limited our choices. Needless to say, 30-60 processors just was not enough capacity to seamlessly transition the algorithms from our primary data center. The only way it could be done was by developing highly specialized processing techniques: a task which could take years.

Now that we are looking at 8 cores per processor with 16 just around the corner everything has changed. Soon, it will be possible to provision anywhere from 160-320 processors under the same constraints as before. It is easy to imagine another doubling of this shortly thereafter. Throw in some virtualization for a more nimble environment and we will be able to do sophisticated processing of data in the field. In fact, high-quality and timely results could alleviate much of the demand for more intensive processing after the fact.

Who needs the WAN and all of its inherent costs and risks? Why pay for expensive connectivity when you could have small clusters with hundreds of processors available in every LAN? If remote processing becomes commonplace because of multicore, we might see the business community gravitate towards the original vision of the Grid.

How Will Users Interact With the Cloud?

By Rich Wellner

This is a repost of a reply I wrote to a LinkedIn question

Mark Mathson gave a great answer and blog link in his reply, but it's worth going down one additional level of detail.

A cloud is operated by something. That something is software and people need to be able to interoperate with that software. So the question is twofold.

1) What does that software do.

2) What does the interaction model look like.

Part one is mostly undefined. The term cloud computing is only a few months old at this point and there is no definition that I've seen that describes in detail what the services are and how they work. Since cloud computing is a subset of grid computing we can make some educated guesses as to how this will turn out.

o There will have to be a security model. This model will be complex enough that I'm calling out additional specifics. Currently there is no model specified in any definition of cloud computing.

o That model includes delegation. In the early development of the grid we had a security model without delegation and it was a non-starter. Anytime you need to request something of a service you need to delegate authority to that service.

o That model will have to be multi-institutional. By this I mean that the model must allow people from different communities to be able to access the resources within the cloud without having to join a common security domain. The owner of the resources will have to be able to make local decisions about who is allowed to use his resources.

o Monitoring will be complex, but must run on a common backplane. In the grid community we have hierarchical, distributed monitoring that allows canonical services and a variety of applications to push monitoring information upstream to consumers. No definition of cloud computing currently has any monitoring specification.

o Data handling will be a challenge. In the grid community we discovered early on that moving data between facilities was a bottleneck due to some decisions made in developing TCP decades ago. We worked around these to develop protocols that move data at near theoretical maximum rates even in WAN environments. We also found that people who want to move a lot of data find it cumbersome to manage the processes to do that themselves. We developed 'fire and forget' mechanisms to moving data. A user can make a request, walk away and check the results the next day. As a side note, this behavior requires delegation to work in a secure fashion.

All of the above have to be dealt with before one even begins to contemplate the VM issues that seem to dominate the cloud computing discussions.

The second part is about how the user will interact. That one is much more trivial to answer. Our users already interact in a variety of ways. Some examples include browsers, native applications, java applications, remote desktops and display technologies like x-windows.

All of those will continue to be in play in a cloud based architecture because each has significant structural, administrative and performance advantages that have led to their survival for a long time.

The cloud won't be about what window a user interacts with, it will be about the plumbing that makes that window useful.

Tuesday, February 26, 2008

Why Should You Use Open Source?

By Rich Wellner

The open source justification is no longer the new path that few organizations have walked. I remember in the mid-90's when I switched from Solaris x86 to BSD and then to linux trying to explain what I was doing to co-workers. At that point I wasn't even trying to justify a decision to migrate some production machines, I was just exploring alternatives on my workstations. Still, I got far more confusion and skepticism then nods of understanding.

Today the world is different. People use open source for a wide variety of things. Most folks understand the landscape and regularly use total cost of ownership and risk mitigation as important parts of their final decision. What's still missing, in some cases, is the ability to take advantage of a unique opportunity that open source give you at an infrastructure layer.

Grid software is fundamentally concerned with managing very complex business needs in a manner that allows humans to understand what is going on with their systems. As such one of the most important aspects is the ability to integrate that infrastructure with applications in a manner that allows developers and system integrators to present simpler interfaces to their users.

With proprietary systems there are often APIs that allow this to be done. However, in no instance that I've seen are these APIs on the 'critical path' for the company making the software. They are always offered essentially as a patch that some powerful customer needed and now is slowly leaking out to the rest of the customer base. These systems also tend to be highly unstable and each version carries changes in the API. These changes are frequently radical and nearly always undocumented until a customer comes across something that has stopped working and raises a stink with the vendor.

Open source software tends to work differently, especially at an infrastructure layer. The components are built by folks who are 'eating their own home cooking' and understand the implications of a change in interface. As such, they tend to be infrequent and, when they do occur, highly justifiable. The reduction in quantity of changes is helpful, but because there is no vendor forcing an upgrade, the fact that you can adopt a new version when the timing is right for your organization is also a big plus.

The world has changed. And it's changed for the better for data center managers globally.

Friday, February 22, 2008

How to Rank High Throughput Computing Enviroments

By Ignacio Martin Llorente

TOP500 lists computers ranked by their performance on the LINPACK Benchmark. It is clear that no single number can reflect the performance of a computer. Linpack is, however, a representative benchmark to evaluate computing platforms as High Performance Computing (HPC) environments, that is in the dedicated execution of a single tightly coupled parallel application. On the other hand, an HTC application comprises the execution of a set of independent tasks, each of which usually performs the same calculation over a subset of parameter values. Although, the HTC model is widely used in Science, Engineering and Business, there is not representative bechmark and model to evaluate the performance of computing platforms as HTC environments. At first sight, it could be agued that there is no need for such a performance model. We agree on this for static and homogeneous systems. However, how can we evaluate a system consisting of heterogeneous and/or dynamic components?.

Benchmarking of Grid infrastructures has always been a highly polemic area. The heterogeneity of the components and the high number of layers in the middleware stack make difficult even to define the aim and scope of the benchmark. A couple of years ago we wrote a paper entitled "Benchmarking of High Throughput Computing Applications on Grids" (R. S. Montero, E. Huedo and I. M. Llorente) for the Parallel Computing Journal presenting a pragmatic approach to evaluate the performance of a Grid infrastructure when running High Throughput Computing (HTC) applications. We demonstrated that the complexity of a whole Grid infrastructure can be represented by only two performance parameters, which can be used to compare infrastructures. The proposed performance model is independent from the middleware stack and valid for any computing infrastructure, so being also applicable for the evaluation of clusters and HPC servers.

The Performance Model

Our proposal is to follow an approach similar to that used by Hockney and Jesshope to characterize the performance of homogeneous array architectures on vector computations. A first-order description of a Grid can be made by using the following formula for the number of tasks completed as a function of time:

n(t)=R*t-N

Note that given the heterogeneous nature of a Grid, the execution time of each task can differ greatly. So the following analysis is valid for general HTC applications, where each task may require distinct instruction streams. The coefficients of the line are called:

Asymptotic performance (R): the maximum rate of performance in tasks executed per second. In the case of an homogeneous array of P processors with an execution time per task T, we have R = P/T.
Half-performance length (N): the number of task required to obtain the half of the asymptotic performance. This parameter is also a measure of the amount of parallelism in the system as seen by the application. In the homogeneous case, for an embarrassingly distributed application we obtain N = P/2.

The above linear relation can be used to define the performance of the system (tasks completed per second) on actual applications with a finite number of tasks:

r(n)=R/(1+N/n)

Interpretation of the Parameters of the Model

This linear model can be interpreted as an idealized representation of a heterogeneous Grid, equivalent to an homogeneous array of 2N processors with an execution time per task 2* N/R.

The half-performance length (N), on the other hand, provides a quantitative measure of the heterogeneity in a Grid. This result can be understood as follows, faster processors contribute in a higher degree to the performance obtained by the system. Therefore the apparent number of processors (2N), from the application's point of view, will be in general lower than the total processors in the Grid (P). We can define the degree of heterogeneity (m) as 2N/P. This parameter varies form m = 1 in the homogeneous case, to m = 0 when the actual number of processors in the Grid is much greater than the apparent number of processors (highly heterogeneous).

N is an useful characterization parameter for Grid infrastructures in the execution of HTC applications. For example, let us consider two different Grids with a similar asymptotic performance. In this case, by analogy with the homogeneous array, a lower N parameter reflects a better performance (in terms of wall time) per Grid resource, since the same performance (in terms of throughput) is delivered by a smaller ‘‘number of processors''.

The Benchmark

We propose the OGF DRMAA implementation of the ED benchmark in the NAS Grid Benchmark suite, with an appropriate scaling to stress the computational capabilities of the infrastructure, as benchmark to apply the performance model. The ED benchmark comprises the execution of several independent tasks. Each one consists in the execution of the SP flow solver with a different initialization parameter for the flow field. These kind of HTC applications can be directly expressed with the DRMAA interface as bulk jobs.

DRMAA represents a suitable and portable API to express distributed communicating jobs, like the NGB. In this sense, the use of standard interfaces allows the comparison between different Grid implementations, since neither NGB nor DRMAA are tied to any specific Grid infrastructure, middleware or tool. DRMAA is implemented with the following available Resource Manager systems: Condor, LSF, Globus GridWay, Grid Engine and PBS.

In the paper we present both an intrusive and a non-intrusive methods to obtain the performance parameters. The light-weight non-intrusive probes provide continual information on the health of the Grid environment, and so a way to measure the dynamic capacity of the Grid, which could eventually be used to generate global meta-scheduler strategies.

An Invitation to Action

We have demonstrated in several publications how the first-order model reflects performance of complex infrastructures running HTC applications. So, why don't we create a TOP500-like ranking of infrastructures?. The ranking could be dynamic, obtaining the parameters with the non-intrusive probes. We have all the ingredients:

A model representing the achieved performance by using only two parameters: asymptotic performance (R) and half-performance length (N)
A benchmark representative of HTC applications: embarrassingly distributed test included in the NAS Grid Benchmark suite
A standard to express the benchmark: OGF DRMAA

Ignacio Martín Llorente
Reprinted from blog.dsa-research.org

Monday, February 18, 2008

How to Monitor Grid Engine

By Sinisa Veseli

You have built and installed your shiny new cluster, installed the Grid Engine software, configured the queues, and announced to the world that your new system is ready to be used. What next? Well, think about your monitoring options…

As users start submitting jobs and hammering the system in every possible way, things will inevitably break on occasion. When something goes wrong in the system, you will want to know about the problem before you start receiving help desk calls and user emails.

The first step in developing an effective strategy for monitoring Grid Engine is learning how to use the available command line tools and how to look for possible issues in the system. Some of the things that you should always pay attention to include:

• queues in the unknown state; instance queue in an unknown state usually means that execution daemon is down on that particular host

• queues and jobs in the error state

• configuration inconsistencies

• load alarms

All of the above information can be easily obtained using the qstat command (e.g., try something like “qstat -f -qs uaAcE -explain aAcE”). It is also not difficult to script basic GE monitoring tasks and come up with a simple infrastructure that is able to alert system administrators to any new or outstanding problems in the system.

As your user base grows, so will your monitoring needs, and you will likely want to extend your monitoring tools. You should consider looking into existing software packages like xml-qstat, which uses XSLT transformations to render Grid Engine command line XML output into different output formats. Alternatively, you can also develop set of your own XSL stylesheets that are customized to your needs, and use widely available command line tools such as xsltproc to generate monitoring web pages from the “qstat -xml” output.

Another interesting Grid Engine monitoring option is the Monitoring Console that comes with Cluster Express (CE). Its main advantage is that it integrates monitoring data from several different sources: Ganglia (system data), Grid Engine Qmaster and ARCo database (job data). However, even though the Cluster Express by itself is easy to install, at the moment integrating the CE Monitoring Console with existing Grid Engine installation requires a little bit of work. I am told that this will be much simplified in the upcoming CE release. In the meantime, if you are really anxious to try the CE Monitoring GUI on your Grid Engine cluster, do not hesitate to send me an email…

Thursday, February 14, 2008

My Head in the Clouds

By Rich Wellner

Up until I read Ian Foster’s Cloud Computing post, I had paid little attention to what the term meant to people. Personally, I had already chalked up the idea as a rebranding of Grid computing. So I asked a number of friends what they thought the differences between the two were. Of course many people not actively involved in the community are not familiar with either concept. (I find the fact that computer professionals know what the latest buzz is around SaaS, and SOA but do not seem to consider how and where they might land these systems peculiar.) In any event, here is a summary of the answers I received:

Grid is characterized by more formalized computing arrangements between user and vendor whereas a Cloud is more for ad hoc resource utilization;
The types of computation on a Grid are typically parallel in nature whereas Clouds are for more simple calculations;
Grid was usurped by vendors to indicate that their services were distributed for better performance and reliability while Cloud had become the term for a generalized set of distributed resources;
Clouds are ethereal – anybody who watches them as they cross the sky knows that…

While I found it rather interesting that while there was some overlap of technical perspectives, I did not get any answers that were identical. I believe that the last description explains this situation nicely while also offering the most interesting take on the topic. A Cloud is a nuanced term that invokes the idea of something beautiful which also evolves rapidly, contains a lot of power, and then is gone. Meanwhile the grid, like the utilities it was conceived from, is known for its reliability, ability to tap into reserve power on a moments notice, as well as their accommodating levels of service. Notice how the first two answers all adhere to this concept? I don’t think this is an accident nor do I think that this escaped the attention of the marketing departments of industry-leaders like Amazon and Google, both of whom operate in what they term Cloud space. While it is distinctly possible that the term organically evolved, it is interesting that they chose to stick with it.

Once more, I found it particularly noteworthy that not one person I queried mentioned the amount of data to be processed. Foster and the Business Week article he references, as well as many others, suggest that we need to think in terms of a great deal more data than we have before. For example, Google wants their people to think in terms of a thousand times more data than that to which they are accustomed.

Heck, I was thinking about writing about the so-called “Data Tsunami” myself – but not in terms of thinking about significantly larger datasets. The datasets we were working with a decade ago were suitably massive for what we were trying to accomplish. Like today, it was not economically feasible to keep it all online at once. The fact is that the incredible leaps in computational capacity have led us to build more complicated problems that demand still more data. As such, a thousand times more data is probably still not enough. If only the networking and storage companies had kept up with the leaps in processing capacity (I was on a gigabit network five years ago and I am using a gigabit network today >sigh<).

Consequently we still have to use the tried and true standard operating procedure of:

First carefully selecting a fraction of the data available for storage (i.e. triggering);
Next this rough dataset is pruned further by pre-processing to find the most interesting records;
Finally detailed analysis is performed on what is computationally feasible.

For example, the Large Hadron Collider will be examining billions of collisions per second but will only store a few hundred per second for later processing (see recent article on Dr. Heuer). We have always needed to think in terms of a thousand times more data than we can possibly process or to become accustomed. Basically the scope of what is economically feasible has changed dramatically over the last few decades, while we continue to be quite resource constrained. Which brings us back to the concept of capacity computing, whether in the form of a transitory Cloud, a steadfast Grid, or even the comfortable @home project. The key here is that people are continuing to push passed the boundaries of what is feasible.

Wednesday, February 13, 2008

The Hot Trend for 2008

By Rich Wellner

Fortune 400 businesses, and many smaller ones, run clusters in many different locations around the world.

As facility, management and other costs continue to become larger and large shares of corporate IT budgets, networking costs continue to fall. The result is that data center consolidation becomes a more reasonable goal.

I've seen this in a few of my customers. Beginning last year people started looking more toward grid technology to help them manage this. As the economy has tightened more people have considered this. Particularly as part of a plan toward cost reduction by moving to open source tools.

The general pattern is that the IT group decides they need to find ways to more effectively manage large and disconnected sets of resources. They turn to grid computing to help them manage that cloud and in the process realize that they have a lot of special purpose machines that are being quite underutilized and that they have enormous duplication of effort in the management of those data centers.

As we've entered into a bear market, many companies are taking a second look at their IT costs and looking for ways to tighten their belts. The combination of open source and grid/cloud computing models offers the ability to do that with open source offering a lower cost software acquisition model and grid computing allowing reduction in IT staff through centralization.

I've also been working with folks on the lost art of environment management. But more on that in a future blog...

Friday, February 8, 2008

How to Build Utility Computing Infrastructures with Globus

This is a guest post by Ignacio Martín Llorente, Professor of Distributed Systems Architectures at Universidad Complutense de Madrid.

While research institutions are interested in Partner Grids that provide access to a higher computing performance to satisfy peak demands and support to face collaborative projects; enterprises understand grid computing as a way to address the changing service needs in an organization. They are interested in in-house resource sharing, to achieve a better return from their information technology investment, supplemented by outsourced resources, to satisfy peak or unusual demands. An Outsourced/Utility Grid would provide pay-per-use computational power when Enterprise Grid resources are overloaded. Such hierarchical grid organization may be extended recursively to federate a higher number of Partner or Outsourced Grid infrastructures with consumer/provider relationships. This would allow supplying resources on demand, making resource provision more agile and adaptive. It would offer, therefore, access to a potentially unlimited computational capacity, causing IT costs to transform from fixed to variable.

In the context of the GridWay project we have developed a Grid Gateway that exposes a WSRF interface to a metascheduling instance, so enabling the creation of hierarchical grid structures. GridGateWay consists of a set of Globus services hosting a GridWay Metascheduler, thus providing a uniform, standard interface for the secure and reliable submission, monitoring and control of jobs. Most functionality is provided through GRAM (Grid Resource Allocation and Management), while scheduling information is provided through MDS (Monitoring and Discovery Service). The security requirement at the user level is addressed by GSI (Globus Security Infrastructure).

The new technology allows different layers of metaschedulers to be arranged in a hierarchical structure. In this arrangement, each target grid is handled as another resource, that is, the underlying grid is characterized as a single resource in the source grid, by means of grid gateways. This strategy encourages companies to federate their grids in order to have a better return of IT investment, and also satisfy peak demands of computation. Furthermore, this solution allows for gradual deployment (from fully in-house to fully outsourced), in order to deal with the obstacles for grid technology adoption, such as enterprise scepticism and IT staff resistance.

This approach also provides the components required for interoperability between existing Grid infrastructures. It is clear that we can’t wait for a single global grid to arise or to become predominant. Instead, we should work to build a seamless integration of the existing grids, which may eventually constitute the ultimate, capital-letter Grid, Grid of grids, or InterGrid, in the same way that the Internet was born. Grid interoperability can be achieved by means of common, ideally standard, grid interfaces, whose existence is an important (if not essential) characteristic of grid systems. Unfortunately, common interfaces (and even less standard ones) are not always available for given services. Then, the use of grid adapters and gateways becomes necessary. In particular, an interoperability solution based on grid gateways provides the infrastructures with significant benefits in terms of autonomy, scalability, deployment and security.

Well, what are you waiting for?, components are open-source, license is Apache v2.0, and we are willing to collaborate with you.

Wednesday, February 6, 2008

Six, No, Eight!, Reasons to Deploy Virtualization

By Rich Wellner

Dan Kusnetzky has a good article about why people use virtualization. He distills things down to the following six goals:

Higher Performance
Increased Scalability
Increased Reliability or Availability
Workload Consolidation
Application Agility
Unified Management Domain

I'd probably add a couple that have some overlap, but are worth calling out independently:

Reduction of Vendor Lock-in
Disaster Recovery

Tuesday, February 5, 2008

How To Write Your Own Load Sensors For Grid Engine

By Sinisa Veseli

As most Grid Engine (GE) administrators know, the GE execution daemon periodically reports values for a number of host load parameters. Those values are stored in the qmaster internal host object, and are used internally (e.g., for job scheduling) if a complex resource attribute with a corresponding name is defined.

Parameters that are reported by default (about 20 or so) are documented in the file $SGE_ROOT/doc/load_parameters.asc. For a large number of clusters the default set is sufficient to adequately describe their load situation. However, for those sites where this is not the case, the Grid Engine software provides administrators with the ability to introduce additional custom load parameters. Accomplishing this task is not difficult, and involves three steps:

1) Provide custom load sensor. This can be a script or a binary that feeds the GE execution daemon with additional load information. It must comply with the following rules:

• It should be written as an infinite loop that waits for user input from the standard input stream.

• If the string “quit” is received, the sensor should exit.

• Otherwise, it should retrieve data necessary for computing the desired load figures, calculate those, and write them to the standard output stream.

• The individual host-related load figures should be reported one per line and in the form “::” (without any blanks). The load figures should be enclosed with a pair of lines containing only “begin” and “end” strings. For example, custom load sensor running on the machine tolkien.univaud.com and measuring parameters n_app_users and n_app_threads might show the following output:

begin

tolkien.univaud.com:n_app_users:12

tolkien.univaud.com:n_app_threads:23

end

Note that for global consumable resources not attached to each host (such as, for example, the number of used floating licenses), the load sensor needs to output string “global” instead of the machine name.

2) For each custom load parameter define complex resource attribute using, for example, the “qconf -mc” command.

3) Enable custom load sensor by executing the “qconf -mconf” command and providing the full path to your script or executable as value for the “load_sensor” parameter. If all goes well, the execution daemon will start reporting the new load parameters within a minute or two, and you should be able to see them using the “qhost -F ” command.

Administrators with decent scripting skills (or those with a bit of luck ☺) can usually implement and enable new load sensors for their Grid Engine installations in a very short period of time. Note that some simple examples for custom load sensors can be found in the Grid Engine Admin Guide, as well as in the corresponding HowTo document.