Technical and Scientific Computing with Grid Engine: January 2008

Friday, January 25, 2008

Sledding and the Trough of Disillusionment

By Rich Wellner

Over the holiday break I did something I haven't done in ages. My brothers and I went sledding. When we were kids, I remember very well graduating in excitement going from the little slope down the block, to being able to go to the big kids hill with a plastic sled and finally to being able to go by myself with a lightening fast sled with steel runners. It was the best!

As we got older and our lives got more complicated we moved on to other things during the winter. Sledding just didn't seem all that fun anymore, and it was a lot of work hauling that sled back up the hill every run.

Anyway, we were sitting around the kitchen one morning admiring the snow and decided to take the kids sledding. Well, we told our wives that we were taking the kids sledding, but the kids gave up very quickly leaving the hill to us. The hill in question is on our family farm in one of the pastures that has a dirt road going up it. We started our slides on the hill above the road, went through the road and then we launched (as shown in the photo) as we hit the shoulder and continued down the rest of the grade.

Conditions the days we sledded were absolutely perfect. There had been about eight inches of snow a week prior that put down a nice base. Then things warmed up for a few days before freezing solid again. The entire farm was transformed into three inches of hard packed snow with a thick icy crust on the top. Finally, we got another six inches of light fluffy snow providing enough cushion to land on without slowing things down too much. Our record slides for the day ended up being about a hundred yards long. It was a blast.

It was a blast, however, that we wouldn't have had if we hadn't pulled ourselves out of our "trough of disillusionment". In the 90's Gartner came up with a curve to describe the Hype cycle associated with technology adoption.

Gartner hype curve As kids we had been fascinated with the speed and independence that sledding gave us. We thought it was the ultimate, but as we got to know it better, it couldn't keep pace with our expectations. At least until we grew up and were willing to accept it on it's own terms.

This is the same point that grid computing is at today. The grid community has years of history and for a while everyone and his brother was jumping on the term 'grid' without regard for whether they were really doing the things necessary to qualify as such. As a result we had a period of inflated expectations ranging from "everything is already the grid" on one end to "we'll never achieve the grid because everyone everywhere isn't going to put all their resources online and let others use them.

Those are examples of expectations that were never advertised by those who kick-started the industry and coined the term grid. They did, however, gain a lot of mind-share. There has been some fallout from this. Over the past few years, as we navigated through the trough of disillusionment, people have begun to adopt grid practices that they thought would add value to specific parts of their businesses. In some cases they even have had to adopt new terms (e.g. 'cloud computing', 'utility computing', 'data center virtualization') because the 'G' word had become unsellable. During this time, the concepts have begun to get widespread traction as people understand how to separate the wheat from the chaff.

So, put on your snow pants and check out some stuff you haven't looked at in a few years. Globus Toolkit is over 11 years old now. Grid Engine is now mature and open source. Huge projects like caBIG are using the far reaching concepts of the original conceptual tenets to try and literally cure cancer.

There are great opportunities out there to create more value for your customers and users, if you can just get past the hype.

Tuesday, January 22, 2008

Four Ways to Recession Proof Your Business with Grid

By Rich Wellner

Jimmy Carter, Ronald Reagan, George H.W. Bush and Bill Clinton all recognized that the economy was the part of american life that everyone needed and noticed acutely. They appointed Paul Volcker and Alan Greenspan to run the fed, decisions critical to 25 years of a largely stable and growing economy.

Now we have Alan Greenspan saying that the current administrations failure to curb spending was "a major mistake", that Republican congressmen were "at a feeding trough" and that they "swapped principle for power [and] deserved to lose [the 2006 congressional election]". We also have current fed chairman Ben Bernanke appearing in Congress asking for an "economic stimulus" package.

I'm not economics professor, but I got out of the market when Greenspan started stumping on "irrational exuberance" and he has my attention on the current economy again. Since I work in services, that got me thinking about how to help my customers work to mitigate their risks going into turbulence that's looking like it might be enough to force some unscheduled stops.

Here are 4 things you can start doing today to prepare your business to ride out any storm that may hit this year.

1) Diversify. Right now most clusters are running as islands. There is a lot of chatter in the news lately about cloud computing. To a first approximation, cloud computing is the idea that users should be able to toss a request into the cloud and get results back. While cloud computing is a new term, the grid community has been doing cloud computing for years. From nearly the beginning of the movement, the security models necessary to do cloud computing have been fundamental parts of Globus Toolkit. Metaschedulers such as Gridway provide the next bit. A user submits their request to Gridway (their interface to the computing cloud) and gets their results back. The practical fallout in recessionary times is that you can expand your customer base by providing a single front end to all the resources in your data centers. This 'cloud' allows you to diversify your resource usage and greatly reduce the risk that some of your clusters will fall into disuse (i.e. disuse == loss of money!) as projects are cancelled, delayed or otherwise unable to meet their projections.

2) Offer Killer Customer Service. Customer service is about controlling the things you can. It's about training your staff to be great and making the experience of dealing with your services something that they look forward to. I'm making a last moment trip on Thursday to a customer site and realized yesterday how much I appreciate my travel agent, Sandie, over at Bursch Travel. She manages all those points programs that get my miles and hotel stays registered and remembers not to cram me into a window seat on an RJ. She's also a pleasure to talk with, even while she suffers through another ridiculously cold winter on the northern plains. She's one of the people I would fight for during a budget war.

You need to be that person for your customers also. There is too often conflict between internal support organizations and user communities. With open source tools you have integration opportunities that don't exist with other tools. As an example, we recently did a prototype project that allowed users to monitor their jobs via their Blackberry. We can perform application specific integrations and present a computational chemist with a custom dashboard (on their mobile or at their desktop) telling them the state of their Turbomole jobs. Now they can know how much progress their job has made using the metrics appropriate for that application, rather than simply whether the job is running somewhere and for how long.

3) Intensify Marketing. What if you threw a cloud and nobody came?

People don't like to admit it, but success in internal groups is measured that same as success for external users, making users aware of capabilities and getting them to use them. You've built your clusters. You've combined them into a grid or cloud. Now you have to get people to take advantage of those resources. Go to the user meetings. Understand their problems. Let them know you and your staff came help. Let them know that the reason you exist is to make them look great.

4) Seek Improvement. You've made a grid investment. One of the by-products of this is an immense amount of data about how your system is being used. Take the initiative to analyze this information before people even recognize what's there to be mined. Show your users the bottlenecks. Show them your failings. Understand what this means to their business. Then, work with them to build solutions. Be their hero!

Tuesday, January 8, 2008

There's a Grid in them thar Clouds*

By Ian Foster

You’ve probably seen the recent flurry of news concerning
“Cloud computing.” Business Week had a long article on it (with an amusing and
pointed critique here). Nick Carr has even written a book about it. So
what is it about, what is new, and what does it mean for information
technology?

The basic idea seems to be that in the future, we won’t
compute on local computers, we will compute in centralized facilities operated
by third-party compute and storage utilities. To which I say, Hallelujah,
assuming that it means no more shrink-wrapped software to unwrap and install.

Needless to say, this is not a new idea. In fact, back in
1960, computing pioneer John
McCarthy predicted that “computation may someday be organized as a public
utility”—and went on to speculate how this might occur.

In the mid 1990s, the term grid was coined to
describe technologies that would allow consumers to obtain computing power on
demand. I
and others posited that by standardizing the protocols used to request
computing power, we could spur the creation of a computing grid, analogous in
form and utility to the electric power grid. Researchers subsequently developed
these ideas in many exciting ways, producing for example large-scale federated
systems (TeraGrid, Open Science Grid, caBIG, EGEE, Earth System Grid, …) that provide not
just computing power, but also data and software, on demand. Standards
organizations (e.g., OGF, OASIS) defined relevant standards. More
prosaically, the term was also co-opted by industry as a marketing term for
clusters. But no viable commercial grid computing providers emerged, at least
not until recently.

So is “cloud computing” just a new name for grid? In
information technology, where technology scales by an order of magnitude, and
in the process reinvents itself, every five years, there is no straightforward
answer to such questions.

Yes: the vision is the same—to reduce the cost of
computing, increase reliability, and increase flexibility by transforming
computers from something that we buy and operate ourselves to something that is
operated by a third party.

But no: things are different now than they were 10 years
ago. We have a new need to analyze massive data, thus motivating greatly
increased demand for computing. Having realized the benefits of moving from
mainframes to commodity clusters, we find that those clusters are darn
expensive to operate. We have low-cost virtualization. And, above all, we have
multiple billions of dollars being spent by the likes of Amazon, Google, and
Microsoft to create real commercial grids containing hundreds of thousands of
computers. The prospect of needing only a credit card to get on-demand access
to 100,000+ computers in tens of data centers distributed throughout the world—resources
that be applied to problems with massive, potentially distributed data, is
exciting! So we’re operating at a different scale, and operating at these new,
more massive scales can demand fundamentally different approaches to tackling
problems. It also enables—indeed is often only applicable to—entirely new
problems.

Nevertheless, yes: the problems are mostly the same in
cloud and grid. There is a common need to be able to manage large facilities;
to define methods by which consumers discover, request, and use resources
provided by the central facilities; and to implement the often highly parallel computations
that execute on those resources. Details differ, but the two communities are
struggling with many of the same issues.

Unfortunately, at least to date, the methods used to
achieve these goals in today’s commercial clouds have not been open and general
purpose, but instead been mostly proprietary and specialized for the specific
internal uses (e.g., large-scale data analysis) of the companies that developed
them. The idea that we might want to enable interoperability between providers
(as in the electric power grid) has not yet surfaced. Grid technologies and protocols speak precisely to these issues, and should be considered.

A final point of commonality: we seem to be seeing the same marketing. The first
“cloud computing clusters”—remarkably similar to the “grid clusters” of a few
years ago—are appearing. Perhaps Oracle 11c is on the horizon?

What does the future hold? I will hazard a few
predictions, based on my belief that the economics of computing will look more and
more like those of energy. Neither the energy nor the computing grids of
tomorrow will look like yesterday’s electric power grid. Both will move towards
a mix of microproduction and large utilities, with increasing numbers of
small-scale producers (wind, solar, biomass, etc., for energy; for computing,
local clusters and embedded processors—in shoes and walls?) co-existing with
large-scale regional producers, and load being distributed among them dynamically.
Yes, I know that computing isn’t really like electricity, but I do believe that
we will nevertheless see parallel evolution, driven by similar forces.

In building this distributed “cloud” or “grid” (“groud”?),
we will need to support on-demand provisioning and configuration of integrated
“virtual systems” providing the precise capabilities needed by an end-user. We
will need to define protocols that allow users and service providers to
discover and hand off demands to other providers, to monitor and manage their
reservations, and arrange payment. We will need tools for managing both the
underlying resources and the resulting distributed computations. We will need
the centralized scale of today’s cloud utilities, and the distribution and
interoperability of today’s grid facilities.

Some of the required protocols and tools will come from
the smart people at Amazon and Google. Others will come from the smart people
working on grid. Others will come from those creating whatever we call this
stuff after grid and cloud. It will be interesting to see to what extent these
different communities manage to find common cause, or instead proceed along
parallel paths.

*An obscure cultural reference: the phrase “There’s
gold in them thar hills” was first uttered, according to some, by an old
prospector in the 1948 movie “Treasure of the Sierra Madre”, starring Humphrey
Bogart.

Multicore: Not Just a Software Crisis

By Roderick Flores

I have been following the exchange over the existence of a multicore crisis with a significant amount of interest. The central question is will software systems as we know it today continue to show the performance gains that they have historically (following Moore’s Law)? As you know, clock speeds have remained fairly constant over the preceding few years and there are physical limits to the number of transistors that can be packed on a chip. Thus the leading manufacturers have begun placing more than one central processing core on a single chip.

So, can software developers continue to rely on computational power increases from multicore architectures to improve performance? Conversely will they need some radical new toolsets in order to produce applications that can harness the power of a chip with a large number of cores on board? In other words, will the trend towards multicore lead to stagnation in application performance? Well, I couldn’t just standby without diving into this fray.

I will not get into the details of the many sophisticated arguments either for or against either side of this position. However I think it is important to note a few. To begin with, there just are not that many embarrassingly parallel problems which would easily lend themselves to multicore processing. The other major problem sets (or dwarfs as they have been termed in The Landscape of Parallel Computing Research: A View from Berkeley) just are not easy to program. As many of you know, parallel programming is typically hard and the tools to make it easy just do not exist. A number of people have also suggested that universities have not adequately trained people for the parallelism that multicore requires. They argue that these architectures are essentially the same as the SMP systems which people have been using for a very long time. Furthermore, threads have been around for a while and programmers should know how to use them.

Personally, I think this is all good and well but somewhat misleading. What if we had an embarrassingly parallel problem, that could be multithreaded, and be written without breakthroughs in toolsets? Would that mean that we could just chug along realizing massive gains from multicore processors without hitting the figurative brick wall?

My most recent parallel programming experience involved a very embarrassingly parallel algorithm. Essentially the process was to load as much of the dataset into memory as you could, perform a large number of calculations on it, and then write the results. I will not go into too much detail because the algorithm is proprietary. In any event, to get speed up, all you had to do was multithread the software and let it loose. The more cores and memory I threw at the problem, the faster it became. I remember the excitement we had when we got near-linear speed-up on a system with a total of four cores.

So what would happen if I could have the quad-processor system with 64 cores each that I hear is just around the corner? You know, exactly the kind of system that most applications will not be able to utilize. I could get by with 16GB of memory and several terabytes of disk – those resources are already available. Shouldn’t I expect my embarrassingly parallel problem to be 256 cores / 4 cores or 128 times faster? In other words a standard model that would take 40 hours on a single computational node might take fewer than 20 minutes. If that were the case, we could have saved an enormous amount of money in hardware costs in lieu of our cluster.

What I can tell you is that this is a fantasy. You can only throw so many cores at a problem because of limitations in memory access speeds and latency, bus speeds, network speeds and latency, and SAN throughput. First of all, to store the nearly one terabyte that an example of the aforementioned problem would produce in fewer than twenty minutes would require some pretty amazing throughput. Now imagine a doubling of the number of cores to 128 per processor: can we push that much data into a disk in less than five minutes? In other words, will there be thirty-odd terabit networks out there by the time 128 core chips are in production? Secondly, imagine that you have 256 threads accessing the data stored in memory. Each thread is going to be in a different part of the calculation and therefore memory access is going to be essentially random. I do not believe that today’s memory handles that kind of load without major contention. In short, you will not see anything like the speed-up for which you are hoping.

Similarly, I believe that you can see this problem on a grid where you do not need to have a massively parallel program to get full use of your hardware. For example, imagine 200 four-way 64-core computational nodes connected to high-end storage through a low-latency network: you have a grid with over 50,000 cores!!! A good resource manager is more than willing to schedule single-threaded or parallel jobs for each of those cores. If your grid has high-utilization rates as is typical (you build it and they will use it), then you can easily see the same problem I described earlier: hundreds of processes contending for severely over-taxed resources.

So I see the crisis as more than just dealing with complicated parallel algorithms or a lack of good programming tools. Rather the crisis seems to be rooted in the entire computational community. In order to make use of systems with numerous cores, there will need to be some significant improvements in internal communication, networking, memory access, and probably a number of other details that I have not even considered, let alone imagined. That is not to say that the arguments surrounding the multicore crisis are not well founded. Rather, I am saying that there is much more with which to be concerned than programming.

Wednesday, January 2, 2008

How to Use Grids to Cure Cancer

By Ivo Janssen

Part 11 in the pharmaceutical sector, SOX in
the financial world. These days, businesses have to adhere to a large slew of
regulations, and implementing a production grid has not become any easier. This
article will go into some of the factors that have to be weighed when installing
grids in these audited types of environment, with specific attention to the Part
11 regulation of Desktop Grids.

First a little background. Title 21 CFR Part 11, or
simply "Part 11", defines the criteria under which electronic records and
electronic signatures are considered to be trustworthy, reliable and equivalent
to paper records. Practically speaking, Part 11 requires drug makers, medical
device manufacturers, biotech companies, and other to implement controls,
including audits, validation systems, audit trails, electronic signatures, and
documentation for software and systems. [1]

Unfortunately, there is no such thing as a handbook
or a turnkey "Part 11 compliant system". Part 11 requires both procedural
controls (i.e. notification, training, SOPs, administration) and administrative
controls to be put in place by the customer in addition to the technical
controls that the vendor can offer. At best, the vendor can offer an application
containing the required technical requirements of a compliant system [2]. As
such, a grid vendor must work with each individual customer in deploying a
custom system. In my company, Univa UD, I have been responsible for implementing
Part 11 compliance with our product, Grid MP, in a number of pharmaceutical
companies around the world.

With grids, and especially desktop grids, the
following issues must be addressed:

User authentication
Application lifecycle
Job tracking
Auditing
External influences on other
systems

User
authentication
A Grid job will typically run on a desktop as a
different user than the submitting user, thus, an SOP for User Management is
very important here. This SOP should include items such as:

Creation of users: Who can create users? Is this a
decentralized process? Linked to a unified logon such as Windows AD or LDAP?
Password management: How often do passwords expire?
Will plaintext passwords in a configuration file for job submission be allowed?
User access control: What application binaries and
input data can a user access?

Application lifecycle
It is of the utmost important in an audit trail
that one can say with certainty that a job was run against a certain specific
version of a binary. For instance, Grid MP offers an abstraction model for
binaries into a "Program" object. The following processes must be
well-defined:

Application version tracking: Are old versions of an application stored and
saved in the system?
Jobs: Do jobs
have a reference to the exact binary or version that was used to compute a
certain job?
Verification: Have applications gone through a
proper test/stage/production cycle before being promoted to production? See
also this entry in this blog.
Paper trail: Do all verification steps and
application promotions have the proper sign-off and paper trail as required by
Part 11?

Job tracking
Keeping track of jobs and their data is possibly
one of the most important and straightforward items in a Part 11 compliant
system. These are the aspects that need to be considered:

Tracking: Are all job attributes tracked? Is it
enough to just list the job submitter? Does
one allow modifying of an existing job by another user, and if so, are
all modifications tracked?
Saving: Is Job meta data (submitter, date, etc)
tracked? Are results saved? Does one know
on which devices jobs ran, and what the state of those devices was at that time
(other software, load, etc.)?

Auditing
Auditing and a paper trail is really the core of
the Part 11 regulation, out of which all previous issues rise.

Job audit: Is
there a log of each and every job in your system? In there a log of any modifications to a job?
Does one track which user created, edited
or deleted a job? For instance, Grid MP
has a reporting subsystem called MP Insight that will track each job, data and
result in the system.
Log files: Not everything is typically saved in such
reporting systems. Are there other logfiles that need to be retained? Is there a
way to authenticate these logfiles, and make sure they are not tampered with? A
digital checksum? A hardcopy with a signature?

External influences
Since Desktop Grids by definition run on hardware
that is already used for other tasks as well, the following question arises:
Even if you Grid is not qualified, does it affect other qualified software
running on these machines? What if your unqualified job runs on a qualified lab
machine doing clinical tests?

Application certification: Is there a plan or
methodology to certify new binaries to make sure that they do not affect (i.e.
crash) other applications outside the sandbox? Is
there a sandbox or virtualization technique to shield the grid
application from the rest of the machine? Is this certification signed off with
the proper paper trail? Which departments should be involved the sign-off?
Tracking: Is there an
audit trail of what job with what data ran on a machine at any given
time? Is there information on the state of that machine (other
installed software base, load) at the time of the grid jobs?

By no means is the above list an exhaustive list,
but it does give you an idea of the scope of questions and problems that arise
when trying to incorporate a grid in a qualified environment. I have worked
with various Top-10 pharmas to implement grids in their qualified environment,
and as I stated in the beginning, there is no Right Way to do so, each company
will have different interpretations of the level of log gathering, retention and
scope. If this all seems daunting to you, then rest assured that Univa UD
offers various consulting and service offerings to help you with these decisions. ;-)