Technical and Scientific Computing with Grid Engine: September 2007

Friday, September 28, 2007

The Grid and Hosting

By Rich Wellner

Jeremy Sherwood from opus:interactive has a good write up of HostingCon 2007.

My experience with Grid Computing goes back to the late 1990s with distributed.net in helping making encryption that much secure. With the technology originally designed to harness unused CPU cycles to solve complex problems, to now being used to hosting an infinite number of hosting environments. It is amazing the level of reliability and scalability options that are available with the system. The ability to grow in resources at an unlimited rate -on the fly- with little to no exposure to change, is outstanding. The other great aspect of this system of technology is the ability to contribute to a sustainable mindset. If done properly, you can reuse old servers and hardware that in a normal life cycle would be recycled, now can be reprovisioned back into a production environment with little concern of impact of hardware failure. This rejuvenation of hardware opens up a great opportunity to get that-much-more out of your initial investment as well as being able to pass those saving onto the customer.

Three Reasons Why High Utilization Rates Are the Wrong Thing to Measure

By Rich Wellner

Attending the working sessions at various conferences I hear a theme over and over again, "how can grid computing help us meet our goal of 80% utilization"? People post graphs showing how they went from 20% utilization to 50% and finally 80%. People celebrate achievement of this number as an axiom. The 80% utilized cluster is the well managed cluster. This is the wrong goal.

The way to illustrate this is to ask how 80% utilization brings a new drug to market more quickly? How does 80% create a new chip? How does 80% get financial results or insurance calculations done more quickly?

Of course, it does none of those things. 80% isn't even a measure of IT efficiency, though most people use it as such. It's only a statistic that deals with a cluster itself. It is, however, measurable, so it's easy to stand up as an objective that the organization can meet. The question to ask is, does an 80% target actually hurt the business of the company?

That target has three problems:

It takes the focus off the business problem the clusters are solving
Most people choose the wrong target (80%, rather than 50%)
We would fire a CFO who only measured costs, we are we willing to only measure them here?

If your clusters are running at 80% that means that you have a lot of periods when work is being queued up and waiting. Think about the utilization pattern of your cluster. Almost every cluster out there is in one of two patterns. They are busy starting at nine in the morning when people start running work and the queue empties overnight. Or, they are busy starting at three in the afternoon when people have finished thinking about what they need to run overnight and the queue empties the next morning.

During the times when the queues are backed up, you are losing time. These jobs waiting represented people who are waiting, scientists who aren't making progress, portfolio analysts who are trailing the competition and semiconductor designers who are spending time managing workflow instead of designing new hardware.

For most businesses it's queue time and latency that matters more than utilization rates. Latency is the time that your most expensive resources, your scientists, designers, engineers, economists and other researchers are waiting for results from the system. Data centers are expensive. Don't get me wrong, I'm not arguing that it's time to start throwing money at clusters without consideration. It's just that understanding the way the business operates is critical to determining what the budget should be. Is the incremental cost of having another 100 or 1000 nodes really more than the cost of delaying the results that your business needs to remain viable?

Don't be willing to be the manager that measures what is convenient rather than what is valuable to the future of your business. Be 'savvy' in your approach. Find ways to understand the behavior of your drug discovery processes on your clusters, even if you are an IT guy instead of a computational chemist. Find ways to demonstrate how your approach of reducing cluster latency is turning up the heat on the next chip design. Find ways to measure what keeps your business around so that you can be part of the process of creating value instead of viewed by that CFO as nothing more than a cost center to be optimized away.

The message is that cost is only one part of the equation. Likely, it's even a minor part of the equation. Don't get yourself lost measuring the price of your stationary when it's the invoices you're putting in the envelopes that matters.

Thursday, September 27, 2007

Warning: Don't Patronize Your Users

By Rich Wellner

One of my favorite quotes is from E.B. White:

No one can write decently who is distrustful of the reader's intelligence, or whose attitude is patronizing

Pawel Plaszczak and I certainly took this sort of goal seriously when we wrote our Savvy Manager's Guide. You should take it seriously when you design your grid.

The single biggest mistake people make is to not trust their users to provide reasonable requirements. Designers and architects go out and talk to users, then write-off the feedback they get as being general guidance, rather than hard requirements.

Google, as an example, took their users seriously from day one. They could have created yet another site so littered with ads that it was unreadable, but instead created a user experience that is now the subject of design classes. You can do the same. Talk to your users. Spend a day understanding how they interact with their system. Get a bit deeper into the business issues that justify the IT expenses that feed your children and pay your mortgage.

Take your users seriously, feel their pain and be their hero.

Wednesday, September 26, 2007

Stop Wasting Time

By Rich Wellner

Seth wrote:

The traffic engineers in New York think nothing of wasting two minutes of each person's time as they approach a gated toll booth. Multiply that two minutes times 12,000 people and it's a lot of hours every day, isn't it?

The truth in this is obvious, and it applies to the grid also. I've talked with folks that has hundreds or engineers each spending a third of their time managing jobs and data on their clusters. That's a lot of time that could be spent advancing their business wasted. Even if just on an expense basis, that's $3M in labor costs. And that's on the low end.

There were significant wins in moving from SMP boxes to clusters. There were significant wins in moving from clusters to grids. Now it's time to realize the next win by managing your grid effectively.

Tuesday, September 25, 2007

Why Are 92% Of Users Waiting for I/O?

By Rich Wellner

IDC recently published a finding that 92% of users have applications that are I/O constrained. That's a shockingly large number given the options that exist for reducing this pain. Let's break this down into three major categories:

Working storage
Near-line storage
Long-term storage

The nature of each of these domains is very different and the options available to reduce the problems are similarly different.

Working Storage

Administrators unfamiliar with the demands of certain classes of applications will sometimes mount their scratch disk as (oh, the horror!) NFS. Over the past five years the average knowledge level has crept up as the community has grown and gained experience, but I have seen in the last couple months that there are still clusters out there than are doing significant I/O to slow network scratch.

Near-line Storage

Typically consisting of disk, NAS or SAN devices people have a tendency to not buy enough bandwidth (either in the form of network I/O or aggregate disk I/O) or view the purchase of these facilities as one time expenses, failing to keep up with their users expanding demand as time progresses.

Long-term Storage

Migrating data to tape is the only game in town for long term, high capacity storage (so far the decade old promise of using disk for long term storage still seems to be a decade away). The problem is that with drives and automated libraries costing enormous amounts of money, coupled with the latencies inherent in this type of storage, applications are left sitting tapping their feet waiting for data to stream in.

The Solution

The services in the grid must be programmed to be aware of the data they require. An early example of this is the DDM system in incubation in dev.globus. This system knows what data is located in different resources on the grid and can thus be integrated with workflow and scheduling systems to pre-stage data to working storage before the application is started. This completely eliminates two of the three I/O constraints, and the last one is the easiest and cheapest. Just stop mounting your scratch on NFS...

Saturday, September 22, 2007

Why You Should Never Rip and Replace

By Rich Wellner

At Univa UD we deal with a variety of different customers. These folks are trying to solve business problems in a semiconductor, life science, financial services, big science and lots of other sectors. What they have in common is that they have existing infrastructure and are hyper-concerned about business disruption while moving in a new direction. They should be.

There are a lot of ways to approach grid computing that require you to replace what you have with something new. This is particularly the case for vendors of proprietary tools. These tools are built on proprietary protocols that make it difficult to integrate other services or applications. Combine these two issues and it can be tough to get anything bigger than a cluster up and running. If you already have a cluster, or more, up and running, this disruption will have a real impact on your ability to accomplish your goals.

To borrow an old saying, you want your approach to be evolutionary rather than revolutionary. This means moving in a new direction using a phased delivery that allows existing work or research to continue without interruption.

With Globus, this is achieved by creating an additional layer atop existing resources. A common security platform is built on local security layers. A common job submission mechanism replaces product specific ones. A monitoring system that can aggregate information from multiple sources replaces those that only report data from their specific resource.

With these steps in place, new users, applications and clusters can be provisioned in ways that allow flexible cluster usage, better aggregate throughput and higher cluster utilization rates. Then, as time permits, existing applications -- and particularly scripts and workflows -- can be ported from their existing platform to interfaces that will allow them to utilize all the bandwidth available in the organization. Dig?