By Rich Wellner
Attending the working sessions at various conferences I hear a theme over and over again, "how can grid computing help us meet our goal of 80% utilization"? People post graphs showing how they went from 20% utilization to 50% and finally 80%. People celebrate achievement of this number as an axiom. The 80% utilized cluster is the well managed cluster. This is the wrong goal.
The way to illustrate this is to ask how 80% utilization brings a new drug to market more quickly? How does 80% create a new chip? How does 80% get financial results or insurance calculations done more quickly?
Of course, it does none of those things. 80% isn't even a measure of IT efficiency, though most people use it as such. It's only a statistic that deals with a cluster itself. It is, however, measurable, so it's easy to stand up as an objective that the organization can meet. The question to ask is, does an 80% target actually hurt the business of the company?
That target has three problems:
- It takes the focus off the business problem the clusters are solving
- Most people choose the wrong target (80%, rather than 50%)
- We would fire a CFO who only measured costs, we are we willing to only measure them here?
If your clusters are running at 80% that means that you have a lot of periods when work is being queued up and waiting. Think about the utilization pattern of your cluster. Almost every cluster out there is in one of two patterns. They are busy starting at nine in the morning when people start running work and the queue empties overnight. Or, they are busy starting at three in the afternoon when people have finished thinking about what they need to run overnight and the queue empties the next morning.
During the times when the queues are backed up, you are losing time. These jobs waiting represented people who are waiting, scientists who aren't making progress, portfolio analysts who are trailing the competition and semiconductor designers who are spending time managing workflow instead of designing new hardware.
For most businesses it's queue time and latency that matters more than utilization rates. Latency is the time that your most expensive resources, your scientists, designers, engineers, economists and other researchers are waiting for results from the system. Data centers are expensive. Don't get me wrong, I'm not arguing that it's time to start throwing money at clusters without consideration. It's just that understanding the way the business operates is critical to determining what the budget should be. Is the incremental cost of having another 100 or 1000 nodes really more than the cost of delaying the results that your business needs to remain viable?
Don't be willing to be the manager that measures what is convenient rather than what is valuable to the future of your business. Be 'savvy' in your approach. Find ways to understand the behavior of your drug discovery processes on your clusters, even if you are an IT guy instead of a computational chemist. Find ways to demonstrate how your approach of reducing cluster latency is turning up the heat on the next chip design. Find ways to measure what keeps your business around so that you can be part of the process of creating value instead of viewed by that CFO as nothing more than a cost center to be optimized away.
The message is that cost is only one part of the equation. Likely, it's even a minor part of the equation. Don't get yourself lost measuring the price of your stationary when it's the invoices you're putting in the envelopes that matters.