Friday, December 21, 2007

Grid Truths

By Rich Wellner

Thanks to Brand Autopsy, I'm reminded of Google's Ten Things statement. Some of the things they consider to be core to their success are also core to making grid computing work.



Focus on the user and all else will follow



For Google this means a clean interface, fast response and honest results. Globus Toolkit and Cluster Express follow this idiom really well.



In the case of Globus Toolkit, their focus has long been allowing users to gain access to a wide variety of resources in a user centric model. The philosophy should be that the user get a single sign-on and be able to use that to move data, get monitoring information, submit jobs and delegate authority to other entities on the grid.



For Cluster Express the idea is to take Globus Toolkit and make it dead simple to install even while combining it with some of the other best open source tools around like Grid Engine and Ganglia. These are the tools users and administrators need to operate a cluster. The result has been a thousand downloads in the first few weeks this user focused package has been available.



It's best to do one thing really, really well



Grid computing is about managing resources effectively. That's it. We're not about making hardware, finding oil reserves, curing cancer or projecting financial markets. Because of that application agnosticism, grid computing works in all those domains and hundreds more.



Democracy on the web works



Grid computing, especially in the open source world, works via standards and many people working together on many different pieces of a complete solution. Scientific users like those at Argonne National Lab and Fermilab collaborate on data storage and movement standards. The best ideas really do win and the results are solutions that scale to billions of files and transfer rates at theoretical maximums even on multi-gigabit rates.



You don't need to be at your desk to need an answer.



With previous generations of cluster management tools, an admin or user had to be at their own computer to understand in any detail what state their jobs were in. Even though the grid allows them to access resources across multiple computers, it still also allows them to manage their workload from web portals, releasing them from any particular desk.



The need for information crosses all borders.



One of the largest stumbling blocks users used to face in HPC was getting access to the data they needed. Tools like Reliable File Transfer Service, Replica Location Service and GridFTP allow information to be scheduled and moved on a global basis.



You can be serious without a suit.



Amen.



Great just isn't good enough.



Grid computing is 11 years old now and has been helping to do great science for most of them. But the developers and users keep pushing the envelope and finding new and better ways to get more done.



Expect that to continue next year.

Monday, December 17, 2007

Top Four Things Cisco Learned Working on Open MPI

By Jeff Squyres

This entry was written by guest blogger Jeff Squyres from Cisco Systems. I met him at SC07 when I attended his Open MPI presentation in the Mellanox booth. He did a great job, much better than most of the presentations at tech conferences, and agreed to share some of his thoughts on how big companies can work effectively in an open source project with our readers.



The general idea of my talk is to help address the answer "Why is Cisco contributing to open source in HPC?" Indeed, much of Cisco's code is closed source. Remember that our crown jewels are the various flavors of IOS (the operating system that powers Cisco Ethernet routers); many people are initially puzzled as to why Cisco is involved in open source projects in HPC.



The short/obvious answer is: it helps us.



Cisco is a company that needs to make money and has a responsibility to its stockholders. We sell products in the HPC space and therefore need a rock-solid, high-performance MPI that works well on our networks. Many customers demand an open source solution, so it is in our best interests to help provide one rather than partially or wholly rely on someone else to provide one. In particular, some of these interests include (but are not limited to):



  • Having engineers at Cisco who can providing direct support to our customers who use open source products
  • Being able to participate in the process and direction of open source projects that are important to us (vs. being an outsider)
  • Leveraging the development and QA resources of both our partners and competitors -- effectively having our efforts magnified by the open source community (and vice versa)
  • Shortening the time between research and productization; working directly with our academic partners to turn today's whacky ideas into tomorrow's common technology


Think of it this way: only certain parties can mass-produce high quality hardware for HPC (i.e., vendors). But *many* people can help produce high quality software -- not just vendors. In the context of this talk, customers (including research and academic customers) have the expertise and capability to *directly* contribute to the software that runs on our hardware. HPC history has proven this point. We'd therefore be foolish to *not* engage HPC-smart customers, researchers, academics, partners, competitors, ...anyone who has an HPC expertise to help make our products better. I certainly cannot speak for others, but I suspect that this rationale is similar to why other vendors participate in HPC open source as well.



Let's not forget that participation in HPC open source helps everyone -- to include the overall size of the HPC market. Here's one example: inter-vendor collaboration, standardization, and better interoperability means happy customers. And happy customers lead to more [happy] customers.



We have learned many things while participating in large open source projects. Below are a few of the nuggets of wisdom that we have earned (and learned). In hindsight, some are obvious, but some are not:

  • Open source is not "free" -- someone has to pay. By spreading the costs among many organizations, we can all get a much better rate of return on our investment.
  • Consensus is good among the members of an open source community (e.g., some members are only participating out of good will), but not always possible. Conflict resolution measures are necessary (and sometimes critical).
  • Just because a project is open source does not guarantee that it is high quality. Those who are interested in a particular part of a project (especially large, complex projects where no single member knows or cares about every aspect of the code base) need to look after it and ensure its quality over time.
  • Differences are good. The entire first year of the Open MPI project was a struggle because the members came from different backgrounds, biases, and held different core fundamentals to be true. It took a long time to realize that exactly these differences are what make a good open source project strong. Heterogeneity is good; differences of opinion are good. They lead to discussion and resolution down to hard, technical facts (vs. religion or "it's true because I've always thought it was true"), which results in better code.



True open source collaboration is like a marriage: it takes work. A lot of hard, hard work. Disagreements occur, mistakes happen, and misunderstandings are inevitable (particularly when not everyone speaks the same native language). But when it all clicks together, the overall result is really, really great. It makes all the hard work well worth it.

Friday, December 14, 2007

Ten Years of Distributed Computing with distributed.net

By Ivo Janssen

Grids come in many shapes and forms, and one of them is the Global Public Grid. Often presented in a philanthropic wrapper, these grids harness the power of thousands, if not hundreds of thousands of computers, often residential personal PCs. Among the most well-known to the general public are Seti@Home, Folding@Home, IBM's World Community Grid and United Devices' now-retired Cancer Research project. Apart from running the Cancer Research project as an employee of United Devices and my involvement with the World Community Grid as a vendor to IBM, I have been involved in a somewhat lesser known but much longer-running public grid project.



One of the longest-running public grid projects is distributed.net, a project that I have been part of since the inception in early 1997. Earlier this year, we celebrated 10 years of "crunching", contributing to various projects in such fields as cryptology and mathematics. It might be a lesser known project in the eyes of the larger public, but still has generated a lot of participation amongst computer enthusiasts and it even won a few awards, most notably a recognition of Jeff Lawson, my coworker and one of the founders of distributed.net, as the most notable person in IT for 1997 by CIO Magazine.



Back in 1997, distributed computing was a very novel concept, but it got jumpstarted by RSA's "RC5-32 Secret Key Challenge", which set out to prove that 56-bit RC5 was not a secure algorithm anymore due to increased speed of computers. In early 1999, distributed.net also proved that DES, another 56-bit algorithm, was getting weak by brute-forcing a secret message in 22 hours, 15 minutes and 4 seconds.



Over the years, distributed.net has undertaken 3 RC5 projects, 2 OGR projects and 3 DES projects, by utilizing over 300,000 participants running on 23 different hardware and software platforms, and it's still going strong. In October 2007, various staff members of our global team came to Austin Texas for a "Code-a-thon", working on the statistics back end to provide our "members" with better individual stats, and a new project that we're planning to roll out in the next couple months.



So if you have any spare cycles on your home computers to spare, why not give distributed.net a try.

Wednesday, December 12, 2007

Proper Testing Environments

By Roderick Flores

It continues to amaze me how many businesses do not have tiered development environments.   Moreover, many of these same companies maintain a very sophisticated production environment with strict change management procedures.  Yet somehow they feel it is apt to keep inadequate staging environments. 



However, we know better: a truly supportive development environment, in the parlance of the agile-development community, must contain a series of independent infrastructures each serving a specific risk-reduction purpose. The idea is that by the time a release reaches the production environment, all of its desired functionality should have been proven as operational.  A typical set of tiers might include:



  • Development – an uncontrolled environment
  • Integration – a loosely controlled environment where the individual pieces of a product are brought together.
  • Quality Assurance – a tightly controlled environment that mirrors production as closely as possible.
  • Production – where your customers access the final product.



Check out Scott Ambler’s diagram on Dr. Dobb's for a supportive environment to see a logical organization of this concept.



So what happens if you cut corners along the way?   I am sure you all know what I am talking about.  Here are a couple of my past favorites (in non-grid environments):




Situation:  You combine quality assurance with the integration and/or the development environments.


Result: New releases for your key products inexplicably fail in production (despite having passed QA testing) because your developers made changes to the operating environment for their latest product.




Situation: You test a load balanced n-tier product on a small (two to three machines) QA environment.


Result:  The application exhibits infrequent but unacceptable data loss because updates from one system are overprinted by those from another.  This is particular onerous to uncover because the application does not fail in the QA environment.



Presumably, we grid managers do all that we can to provide test frameworks adequate enough to avoid problems such as these. There are many texts that discuss the best practices for supportive development environments. Unfortunately, I have found that many of us forget one of our core lessons: everything becomes much more complicated at grid-scales.  Consequently, we are perfectly willing to use a QA environment scoped for a small cluster.





In particular many of us prefer to limit our QA environments to a few computation nodes.  Thus we choose to run our load tests on our production infrastructure.  Conceptually, this makes sense: we cannot realistically maintain the same number of nodes in QA as are in production, so why keep a significant number when we will end up running some of our tests out there anyway?  Sadly, this approach severely complicates performance measures.



For example, assume that my test plans dictate that I run tests ranging from one to sixty-four nodes for a particular application.  If I run this in production, I am essentially getting random loads on the SAN, network, and even the individual servers to which I am assigned.  Consequently I have to run each individual test from the plan repeatedly until I am certain that I have a statistically significant sample of grid states.  Yet I have only defined my capacity on the grid for the average of the utilization rates during my testing.  Any changes to capacity on the grid such as a change in usage patterns or the addition of resources will invalidate my results.



Clearly, I need to run the application on a segregated infrastructure to get proper theoretical performance estimates.  The segregated infrastructure, like any QA environment, should match production as closely as possible.  However, in order to eliminate external factors that seriously affect performance, it is imperative that you use isolated network equipment as well as storage.  Another advantage of this approach is that we reduce the risk of impacting production capacity with a runaway job.  Similarly it takes a large number of test runs to produce numbers that hope to ignore current load factors and thus approach theory.    Obviously this may impact the grid users’ productivity.



As we noted earlier, we cannot justify a QA environment that is anything more than a fraction of production.  However I am certain that eight nodes is not enough. Certainly QA should contain enough nodes to adequately model the speed-up that your business proponents are looking for in their typical application sets.  It would not hurt to do some capacity planning at this point.  In absence of that, thirty-two computation nodes is the minimum size I would use for a grid which is expected to contain several hundred nodes.   



Finally, once we have a reasonable understanding of the theoretical capabilities of the application, then we should re-run the performance tests under production loads.  This will help us understand the lost productivity of our applications under load.  In turn this could help justify the expense of additional resources even if utilization rates cannot. 



I know you are asking, “how do I justify the expense of a large QA environment?”  Well, just think about the time you will save during your next major change to your operating systems and how you have to test ALL of the production applications affected before you migrate that change into production.  Would you prefer to do this on a few nodes, take several out of production, or just get it done on your properly sized test environment?

Tuesday, December 4, 2007

What You Need to Know About Cluster Express 3.0

By Ivo Janssen

At Supercomputing 2007, Univa UD launched Cluster Express 3.0 beta. If you were at SC'07, you might have attended one of my demos on Cluster Express, but if you missed it, then this blog post is for you. I will cut through the marketing speak for you and, as an engineer who worked on the CE3.0 release, tell you what Cluster Express can mean to you. 



Cluster Express is designed to be your one-stop-shop for a full cluster software stack. This means we bundle the scheduler, the security framework, the cluster monitoring and an easy installer that will configure everything out of the box. On top of that, the whole solution is open sourced, including all the code that Univa UD contributed to the stack. You can go to our new community at www.grid.org and download the CE3.0 beta and its sources right now.



So let's go through all the components in more detail.



Installer
Our installer is an very simple utility that will ask you less than 5 questions, after which it will go off and install the main nodes, the execution nodes, and any remote login nodes. It then will tie all these nodes together through a bootstrap service that is installed on the main node. This lets all the other nodes retrieve configuration information from the main node. The end result is that a fully configured cluster emerges, with sensible default configuration for the Grid Engine scheduler and the Ganglia monitoring, and all the certificates for security and authentication set up properly.



Scheduler
We bundle Grid Engine and the installer will configure all the nodes in such a way that after running the installer on an execution node, this node will be part of the cluster automatically, including sensible defaults for queues, and communication and scheduling settings.



Monitoring
We bundle, install, configure, and use various cluster monitoring tools such as Ganglia and ARCo and tie everything together in a custom Monitoring UI that we wrote and delivered as part of the CE3.0 release. The Monitoring UI is not a third-party bundled tool but really a new add-on to our solution. It brings together the system level statistics that Ganglia offers with the job level statistics that ARCo logs from Grid Engine. By presenting them together in one UI, you can cross reference jobs with the nodes that they ran on, and  the loads on that host. This will allow you, for instance, to instantly realize what the impact of running a job or task is on a certain nodes, in real-time and through an easy-to-use graphical UI.



Security
We bundle and pre-configure many Globus Toolkit components such as MyProxy, Auto-CA, RFT, WS-GRAM, GridFTP and GSI-OpenSSH. Auto-CA and MyProxy are completely configured out of the box, so that the only thing you need to do is a simple myproxy-logon to acquire a token that is valid for use with all the other Globus commands such as globus-url-copy or globusrun-ws. The level of integration that we accomplished for all the GT components will definitely impress you, especially if you've been a Globus user before.



Putting it all together
As said, the full bundling of all the above mentioned components in a tarball with an easy-to-use installer now makes setting up a fully featured cluster as simple as downloading one file and running one command. This is really as easy as we could make it! And  on top of that, everything is open-sourced, including our own add-ons such as the installer and configuration scripts, and the Monitoring UI.



I hope that I can welcome you soon on our new community website around Cluster Express at www.grid.org. You can download the CE3.0 tarball there, and participate in forums, add to our wiki, or get support through our mailing lists.



I'm user "Leto" on grid.org, please don't hesitate to send me a private message there if you need any help at all.

Monday, December 3, 2007

How to Enable Rescheduling of Grid Engine Jobs after Machine Failures

By Sinisa Veseli

Checkpointing is one of the most useful features that Grid Engine (GE) offers. As status of checkpointed jobs is periodically saved to disk, those jobs can be restarted from the checkpoint in case they do not finish for some reason (e.g., due to a system crash). In this way, any possible loss of processing for long running jobs is limited to a few minutes, as opposed to hours or even days.



When learning about Grid Engine checkpointing I found the corresponding HowTo to be extremely useful. However, this document does not contain all the details necessary to enable checkpointed job rescheduling after machine failure. If you'd like to enable that feature, you should do the following:



1) Configure your checkpointing environment using “qconf -mckpt” command (use “qconf -ackpt” for adding a new environment), and make sure that the environment’s “when” parameter includes letter ‘r’ (for “reschedule”). Alternatively, if you are using the “qmon” GUI, make sure that the “Reschedule Job” box is checked in the checkpoint object dialog box.



2) Use “qconf -mconf” command (or the “qmon” GUI) to edit the global cluster configuration and set the “reschedule_unknown” parameter to a non-zero time. This parameter determines whether jobs on hosts in unknown state are rescheduled and thus sent to other hosts. The special (default) value of 00:00:00 means that jobs will not be rescheduled from the host on which they were originally running.



3) Rescheduling is only initiated for jobs that have activated the rerun flag. Therefore, you must make sure that checkpointed jobs are submitted with “-r y” option of the “qsub” command, in addition to the “-ckpt < ckpt_env_name >” option.



Note that jobs that are not using checkpointing will be rescheduled only if they are running in queues that have the “rerun” option set to true, in addition to being submitted with “-r y” option. Parallel jobs are only rescheduled if the host on which their master task executes gets into an unknown state.

Saturday, December 1, 2007

Grid Engine 6.1u3 Release

By Sinisa Veseli

Few days ago the Grid Engine project has released 6.1 Update 3 version of its software. This is a maintenance release (see the original announcement), so that the software does not yet contain the advance reservation features. However, it has quite a few interesting bugfixes. In particular, few issues with qstat output have been fixed (mostly differences between plan text and xml output), and a couple of ARCo problems have been resolved. The resolution for bug affecting users with a very large primary group entry has been made it into this release as well.



The new version of the software is available for download here.

Reservation Features Come to Grid Engine

By Sinisa Veseli

The next major update release of the Grid Engine software will contain advance reservation (AR) features (see the original announcement). This functionality will allow users or administrators to manipulate reservations of specific resources for future use. More specifically, users will be able to request new AR, delete existing AR, and show granted ARs. The reserved resources will only be available for special jobs as of the reservation start time.

In order to support the AR features the new set of command line interfaces is being introduced (qrsub, qrdel and qrstat). Additionally, the existing commands like qsub will be getting new switches, and the qmon GUI will be getting a new panel that will allow submitting, deleting, and listing AR requests. It is also worth noting that the default qstat output might change.

If you are anxious to try it out, the latest Grid Engine 6.1 snapshot binaries containing new AR features is available for download here. Note, however, that this snapshot (based on the 6.1u2 release) is not compatible with prior snapshots or versions, and that an upgrade procedure is currently not available.

Tuesday, November 27, 2007

Unleash the Monster: Distributed Virtual Resource Management

By Roderick Flores

Recently we explored the concept of a virtualized grid: a system where the computation environment resides on virtualized operating environments. This approach simplifies the support of the grid-user community’s specialized needs. Further, we discussed the networking difficulties that arise from instantiating systems on the fly including routing and the increased network load distributing images to hypervisor would create. However we have not yet discussed how these virtualized computational environments would come to exist for the users at the right time.



The dominant distributed resource management (DRM) products do not interact with hypervisors to create virtual machines (VMs). Two notable exceptions are Moab from Cluster Resources and GridMP from Univa UD.  Moab supports virtualization on specific nodes using the node control command (mnodectl).  However they are not created on the available nodes as needed.



Consequently grid users who wish to execute their jobs on a custom execution environment will have to follow this procedure:



  • Determine which nodes were provided by the DRM's scheduler.  If any of these nodes are running default VMs for other processes, these may need to be modified or suspended in order to free up resources;
  • Create a set of virtual machines on the provided nodes;
  • Distribute their computation jobs to each of those machines once they are sure they have entered a usable state;
  • Monitor computation jobs for completion; and
  • Finally,once you are certain the jobs are complete, tear down the VMs.  You may be required to restore any VMs that existed before you started.


Sadly, the onus is on the user to guarantee that there are sufficient images for the number of requested nodes.  They are also required to notify the DRM which resources it will take during the computational process.  If this is not done, additional processes could be started on the same node and resource contention could result.



In addition to the extra responsibilities put upon the grid user, they will also lose many of the advantages that resource managers typically offer.  There is no efficiency associated with managing the resources  of the VMs beyond their single use.  If a particular environment could be used repeatedly, that operation must be managed by the user.  Also, the DRM can only preempt the job that started the virtual machines and in turn the computational jobs.  If this process is preempted, then neither the computational job nor the VMs will be  affected.  If other jobs are typically run on a default VM, there could be issues.  Finally, the user may lose some of the more sophisticated capabilities built into the resource manager (such as control over parallel environments).



All of these issues could be solved by tightly integrating the DRM with the dominant VM hypervisors (managers).  The DRM should be able to start, shutdown, suspend, and modify virtual environments on any of the nodes under its control.  It should also be able to query the state of the physical machine and all of its operating VMs.  Ideally either the industry and/or our community would come to consensus on an interface that all hypervisors should expose to the DRM.  If we put our minds to it, we could describe any number of useful features that a DRM could provide when integrated with virtual machine managers; these concepts simply need to be realized to make this architecture feasible. 



Here are my thoughts about what a resources manager in a virtualized environment might provide:



  • It could be able to rollback an image to its start state after a secure process was executed on it.
  • It could be aware of the resources each VM were limited to so that it could most efficiently schedule multiple machines per physical node.
  • It should distinguish between access-controlled VMs versus public instances to which it may schedule any jobs.
  • It should stage the booting of VMs so that we do not flood the network by transferring operating system images.  A sophisticated DRM might even transport images to local storage before the node's primary resources are free.  Readers of the previous posts will recall that the hypervisor interactions should be on a segregated network so as not to interfere with the computational traffic. 
  • It could suspend VMs as an alternative to preempting jobs.  Similarly, it could suspend a VM, transport its image to another physical node, and restart it.  If the DRM managed output files as resources, it could prohibit other processes from writing to the files still open from the suspended systems.
  • It could run specialized servers for two-tier applications and modify the resource allocation for the VM should it become resource constrained.


I am sure that other grid managers could improve on as well as append this list with other excellent ideas. 



In summary, we have examined the flexibility that a grid with virtualized nodes provides.  As clusters evolve from dedicated systems for a homogeneous user community into grids serving a diverse set of user requirements, I believe that grid managers will require the virtualized environment that we have been exploring.   Clearly the key to creating this capability is to integrate hypervisor into our resources managers; without it, VM management is simply too complicated for the scale we are targeting.



Thus far, nothing that we have explored helps us manage and describe the dynamic system that this framework requires (as I am sure you have noticed).  Is this architecture a Frankenstein's monster that will turn on its creators?  That said, next time we will explore how we might monitor  and create reports for a system that changes from one moement to the next.   

Monday, November 19, 2007

Five Ways to Improve Your Hiring Tactics

By Rich Wellner

The company I work for, Univa UD, is hiring and I was sitting down with one of the managers to talk about approaches. Since long before I joined Univa UD, I've been very interested in recruiting as I ran a few small companies and hired on an international basis. Recruiting is the single most important thing that we do. Everything else -- serving our customers, building insanely great products, profiting or creating a fun workplace -- is the result of hiring well.



Talking with that hiring manager, we put together the essential five tactics to managing the candidate acquisition process:



  • When using an online system, buy a multi-month plan. Even if you are feeling a cost crunch, it's unrealistic to believe that the right candidate will walk through the door during the first couple weeks that you're engaged in a search. If you are staffing more than one position, this is even more true. We will be hiring for more than a month, we're going to buy access for more than a month.
  • Spread the work across multiple hiring managers. Recruiting is work. This is important, so I'll say it again. Recruiting is work. Treat it like the important work that it is and make sure the right people are involved in the process. If there are folks who are wordsmithing geniuses get them involved in the production of the posts. If you have people that are brilliant at interviewing, make sure they are talking with candidates even if they will report to another manager. Conversely, your companies success depends on everyone performing well. Be generous with your time and add value to the hiring processes of the other teams in your company.
  • Spend a couple hours reading sites like Copy Blogger. It has great tips on making your writing better and, let's face it, a help wanted ad is marketing. We need to stand out among the thousands of other companies if we want to attract the best people.
  • Update your ads at least once each week. This shouldn't be a rewrite, but each time you update your ad it pops back to the top of the search stack. This may seem like gaming the system, but it works. When I had a break in hiring this summer and I stopped doing this I noticed immediately the tail off in responses as each week ticked by.
  • Don't post a requirements list, tell a story. The posting may well be the only contact you have with people. The posting needs to draw them in and compel them to make the next step and respond to your ad.



And, in the spirit of giving out a free lunch for thanksgiving, a bonus tip:



  • Find more ways to to get the word out. Use your blog, LinkedIn account or Facebook to let people in your community know that you are hiring. Reach out to people in as many ways as you can think of, the IT market is competitive again and you can't stand still waiting for people to come to you and expect to make great hires. Get out there and be great at recruiting, it's the most important thing you can do!




Friday, November 16, 2007

How to Improve qconf Productivity

By Ivo Janssen

qconf
is without a doubt already a very powerful tool when it comes to administering you
Grid Engine installation, but with a little shell foo, it can become even
more powerful.



How often have you started to type 'qconf -mq ' before realizing you
don't know the exact name of your queue, necessitating a quick 'qconf -sql'
first. After Dan
Templeton
shared a very useful Grid Engine Cheat Sheet with me a few
weeks ago (also see this
announcement
on gridengine.info), I realized that many commands share the same drawback.



Well, bash's autocompletion
framework
can be put to good use
here. See, many people don't know that bash's autocompletion can complete just
about anything, not just filenames.



You can download my qconf_completion.sh script from my website, simply dump it in e.g. $SGE_ROOT/util and add the following line to the end of your
$SGE_ROOT/$SGE_CELL/common/settings.sh:


. $SGE_ROOT/util/qconf_completion.sh



Now you go from the wieldy:

qconf -mq

qconf -mq^H^H

qconf -sql

qconf -mq thisodd.q



To:

qconf -mq t[TAB] and you're done.



So far I've only implemented most of the interactive options, and have not
worked on autocompleting options such as -aattr or -mattr yet, although I see
huge possibilities for productivity improvement there.
By the way, this exercise was a great way to re-appreciate
the intricate regularity that underlies the option set.



Enjoy, and leave a comment if you like the script or have suggestions for
improvement.

Wednesday, November 7, 2007

Hookin' Up is Hard to Do

By Roderick Flores

Previously we discussed the tension that grid managers face when supporting various stakeholders on an enterprise grid.  In particular we concluded that providing isolated virtual operating environments to each of the business units operating in your environment would be the easiest way to meet their competing and divergent needs.  In this post we will explore the networking challenges that a grid of virtualized systems poses.



The primary challenge you face in this architecture is how to connect it all together.  At first glance it seems simple enough: take your current grid, install a hypervisor on each of its nodes, and then start implementing your user’s specific environments.  Sadly, this will probably not work.



In a typical grid you already have to consider the challenges of connecting several hundred compute nodes to one another and a storage network while keeping network latency low. 



In order to illustrate the networking problems you would have in a virtualized grid, consider a system with a significant number of nodes used by several operational units.  For example, imagine a large financial services company that provides banking, brokerage services, insurance, mortgage, and financing.  Each of these business lines, while related, has their own distinct set of business application workflows.   While there may be some overlap of the specific applications used by each of the units, there is little guarantee that each group will use those applications in the same way let alone use the same versions. Worse yet, a business unit may have multiple operational workflows which do not operate in similar environments (e.g. windows versus Linux specific applications suites).  Finally, we grid managers would like to have development, test, and production instances segregated but running on the same hardware . 



It is easy to project having to support at least ten times more virtual than physical operating environments.  The actual number should be proportional to the number of unique operating environments required by the users. In a standard grid you have a fixed set of computational resources that are reasonably static; in other words systems do not appear and disappear on a regular basis.  However in the virtualized grid, operating environments are going to appear and disappear as a function of the business workflows scheduled by your users.  You can imagine how quickly this can become complicated.



What is the best way to deliver these operating environments to the physical hardware?  If we keep all of the images on local disk then we need to guarantee that there is sufficient disk space on each node; a practice which not only can be costly but does not scale well.  If we choose to keep no more than the maximum number of nodes supported by any application in each operating environment, we can reduce the number of virtual machines we require.  Of course this implies that these images are either stored on a SAN or are transported to the individual physical nodes before booting the virtualized environment.  Sadly, both of these approaches significantly increase network loads.  We will discuss scheduling and managing individual virtual machines in subsequent posts.



How do we connect these virtual environments? If these systems were on segregated physical hardware (think Microsoft Windows versus Linux) we would likely keep them on their own network and/or VLANs.  After all, these environments generally should not interact with one another.  Consequently, shouldn’t we also do this for the virtualized grid?  If we chose not to and instead used DHCP based upon physical topology to provide addresses to the virtualized environments, we could quickly run into trouble.  Specifically, a single job executed on n nodes could conceivably land on n distinct networks and/or VLANs.  This would significantly increase the size of the broadcast domain as well as require more work from your network switches.  Therefore it would add significant latency to all communications between the nodes. Clearly this is a poor choice unless you are always using most of your nodes for each job.



Thus my preferred solution is to segregate operational environments, so that every physical node bridges traffic for several distinct networks over the same interface.  Addresses would be assigned by virtual MAC addresses rather than physical location.  As in the counter-example, this occurs because we will not be able to guarantee where on the physical network topology a particular job is scheduled.  In fact, we probably want to use VLAN tags on our packets so that our switches could more efficiently operate.  Additionally if your grid nodes have secondary interfaces, all communication with the hypervisor should be segregated to its own management network.



If this has not scared you away from the concept of  the virtualized grid (I hope it hasn’t), we will continue to explore other hurdles inherent with this architecture in future posts.

Monday, November 5, 2007

How to Decipher Grid Engine Statuses – Part II

Sinisa Veseli

In Part I of this article I’ve discussed meanings of various queue states that one might see after invoking the Grid Engine qstat command. The list of possible job states is just as long as the list of queue states:



• d (deletion) — Indicates that a job has been deleted using qdel.



• r (running) — Indicates that a job is about to be executed or is already executing.



• R (restarted) — Indicates that the job was restarted. This state can be caused by a job migration or because of one of the reasons described in the -r section of the qsub man page.



• s (suspended) — Shows that an already running job has been suspended using qmod.



• S (suspended) — Show that an already running job has been suspended because the queue that it belongs to has been suspended.



• t (transferring) — Indicates that a job is about to be executed or is already executing.



• T (threshold) — Show that an already running job has been suspended because at least one suspend threshold of the corresponding queue was exceeded, and that the job has been suspended as a consequence.



• w (waiting) — Indicates that the job is suspended pending the availability of a critical resource or specified condition.



• q (queued) — Indicates that the job has been queued.



• E (error) — Indicates that the job is in the error state. You can find the reason for this state using the qstat command with “-explain E” option.



• h (hold) — Indicates that the job is not eligible for execution due to a hold state assigned to it via qhold, qalter, or qsub -h command. 



Just like with queue states, one also frequently encounters various combinations of the above job states.

Thursday, November 1, 2007

Grids, grids, grids: Which side of the pond wins?

By Scott Koranda

Dan Ciruli at West Coast Grid writes



Europe is years ahead of the US in terms of large grids...



Is Europe years ahead of the US?



Open questions that come to mind include:



  • What is a "large" grid?
  • What makes one region "ahead" of another?
  • What makes one region "years" ahead?
  • If one region is years ahead, what are the reasons for it?
  • What of other regions outside of Europe and the US?


Certainly the US and Europe both have some very large grids, so the question is, what was Dan taking into account when making his claim.

Tuesday, October 30, 2007

How to Decipher Grid Engine Statuses – Part I

By Sinisa Veseli

In all likelihood most of the Grid Engine (GE) end users and administrators have at some point invoked the qstat command and found themselves wondering what do some of the resulting queue and job status letters mean. While some of those letters are pretty intuitive (e.g., ‘E’ stands for error), some are not entirely trivial to decipher. Unfortunately, it does not seem to be very easy to find explanation for these statuses. One usually has to resort to digging through the qstat man pages or through the various GE software manuals that one can find on the web. So, I’ve compiled below information about possible queue statuses:



• a (alarm) – At least one of the load thresholds defined in the load_thresholds list of the queue configuration is currently exceeded. This state prevents GE from scheduling further jobs to that queue. You can find the reason for the alarm state using the qstat command with “-explain a” option.



• A (Alarm) – At least one of the suspend thresholds of the queue is currently exceeded. This state causes jobs running in that queue to be successively suspended until no threshold is violated. You can see the reason for this state using the qstat command with “-explain A” option.



• c (configuration ambiguous) – The queue instance configuration (specified in GE configuration files) is ambiguous. The state resolves when the configuration becomes unambiguous again. This state prevents you from scheduling further jobs to that queue instance. You can find detailed reasons why a queue instance entered this state in the sge_qmaster messages file, or by using the qstat command with “-explain c” option. For queue instances in this state, the cluster queue's default settings are used for the ambiguous attribute.



• C (Calendar suspended) – The queue has been suspended automatically using the GE calendar facility.



• d (disabled) – Queues are disabled and released using the qmod command. Disabling a queue will prevent new jobs to be scheduled for execution in that queue, but it will not affect jobs that are already running there.



• D (Disabled) – The queue has been disabled automatically using the GE calendar facility.



• E (Error) – The queue is in the error state. You can find the reason for this state using the qstat command with “-explain E” option.  Check that daemon's error log for information on how to resolve the problem, and clear the queue state afterwards using the qmod command with the -cq option.



• o (orphaned) – The current cluster queue's configuration and host group configuration no longer needs this queue instance. The queue instance is kept because unfinished jobs are still associated with it. The orphaned state prevents you from scheduling further jobs to that queue instance. It disappears from qstat output when these jobs finish. To help resolve an orphaned queue instance associated with a job, you use the qdel command. You can revive an orphaned queue instance by changing the cluster queue configuration so that the configuration covers that queue instance.



• s (suspended) – Queues are suspended and un-suspended using the qmod command. Suspending a queue suspends all jobs executing in that queue.



• S (Subordinate) – The queue has been suspended due to subordination to another queue. When queue is suspended, regardless of the cause, all jobs executing in that queue are suspended too.



• u (unknown) – The corresponding GE execution daemon (sge_execd) cannot be contacted.



I hope that those who are new to Grid Engine find the above descriptions useful. In Part II of this article I will cover possible job statuses.

Friday, October 26, 2007

Dream Big, Dream Grid

By Ivo Janssen

Last time we talked about two similar
yet different benefits of using grids. Today we will expand on that list with
other benefits you might not have yet thought about. Just to be clear, we’re
purely talking about technical benefits here, the business benefits are left
for a whole other column.



Let’s first review what we found last time. The obvious
benefits revolve around speedup of your parallel applications and higher
throughput of your batch jobs. A typical example of the former is a
crash-simulation with PAM-CRASH and MPI, a typical example of the latter is
doing virtual high-throughput screening with applications such as LigandFit
from Accelrys, where many potential drug targets are screened against a single
protein target. But there are other less obvious use-cases for grid that can
benefit you.



Imagine running a simulation that has many tweakable parameters
that you’ve always set to a pre-set value. When you now move your computations
to a grid, you might not need to get your results back any faster, so you could
now opt to increase the accuracy of your computation by running the same
simulation with different parameter sweeps on different nodes. Further expansion
of your grid will suddenly increase the validity and accuracy of your results, rather
than decrease runtime. An example of such computation can be found in the Oil
and Gas industry where a more refined and accurate computational model of an
oil-field can prevent costly dry holes.



One could assert that Monte Carlo situations are in fact
also "accuracy-increasing" applications of grid, but there are two subtle
differences. First, Monte Carlo simulations run usually on a much more massive
scale, with thousands of very short simulations, where parameter sweep modeling
typically utilizes larger models on a limited (less than a hundred) number of
iterations. Second, typical Monte Carlo simulations only end once a pre-set certain resolution has been achieved,  regardless of the number of grid nodes to your disposal. As such, it is better
to categorize Monte Carlo simulations in the "throughput" category.



Once you understand these three basic benefits (speed-up,
throughput and accuracy), there’s really no limit to what your imagination can
come up with in terms of new applications of grid. Take the Ligandfit example
that I mentioned earlier. United Devices' recently retired grid.org looked at
the throughput use-case and took it to the extreme by simply taking a protein
crucial to the internal workings of cancer cells and running every single
possible potential drug target in the library against that protein. It took a
leap of imagination to dream up six years of running billions of drug targets
against multiple proteins.



The most rewarding moment during a consulting engagement is
when I see that users "get" the basic use-cases and start dreaming big. Can you
dream big?  What can the grid do for you?

Monday, October 22, 2007

No CPU Left Behind

By Borja Sotomayor


For some time now, I've been really interested in the potential applications of grid computing in higher education and, possibly, in secondary education. So, I was really intrigued when I read about Google and IBM's computing cloud for students. Just looking at the headline, my first impression was that students anywhere would be able to have their own computing cloud to use as a playground for learning and experimentation. As it turns out, Google and IBM's computing cloud will be initially used by only five universities, with the goal of giving students a platform in which to learn about parallel programming and Internet-scale applications. Although still a very cool project, I thought this would be a good opportunity to share some ideas of how grid computing could end up benefiting education. Like fellow gridguru Tim Freeman, I'm a part of the Globus Virtual Workspaces project, so my ideas are biased towards how grid computing and workspaces could benefit education.



I have talked with many Computer Science and Engineering lecturers and professors at small colleges and universities who cannot teach certain courses for lack of computing resources. For example, while teaching an introductory programming course requires minimal computing resources (such as a computer lab), teaching a course on parallel programming or distributed systems may require more expensive resources. To get students to practice parallel programming in a somewhat realistic setting, you would like them to have access to a properly configured and maintained cluster. If, furthermore, you wanted to teach students how to set up a cluster, you would need a couple of clusters (ideally, one cluster per student) that the students could have unfettered access to.



There are two main issues with the above scenario. First of all, clusters aren't generally cheap, and some institutions can't afford one. Of course, you can easily build a cluster out of commodity hardware, but you also need someone to actually set it up and jiggle the handle whenever something goes awry. In one specific case, a department built a cluster with off-the-shelf PCs, and used it successfully... until the grad student charged with keeping the cluster running graduating. Apparently, that cluster has been sitting idly in a room for years now. Second, even if the institution can afford a cluster and a sysadmin, no sysadmin in his right mind is going to give root access to that cluster to undergrads, specially if that cluster is also used by researchers.



Enter virtual workspaces. In a nutshell, a virtual workspaces is an execution environment that you can dynamically and securely deploy on the grid with exactly the hardware and software you need. You need a 32-node dual CPU Linux cluster for a couple of hours to teach a parallel programming lab, with a very specific version of libfoobar installed on it? Just request a workspace for it, and that hardware will be allocated somewhere on the grid for you, and the software will be set up thanks to software contextualization, which Tim will discuss in his posts. There's no need for the institution to keep a cluster running 24/7, or even spend any time configuring a cluster (requiring a sysadmin, or burdening the lecturer or a grad student with this task). From a repository of ready-made workspaces, simply choose the one you want (or pay a one-time fee to have someone configure a workspace exactly the way you want it), deploy it on the grid ever Monday from 2pm to 4pm, and start teaching.



Unfortunately, we're not quite there yet, but virtual workspaces are being actively researched (yes, right now, even as you read this blog post!). Currently, virtual machines are the most promising vehicle to automagically stand up these custom execution environments on a grid. The Globus Virtual Workspaces Service, which uses the Xen VMM to instantiate workspaces, is still in a Technology Preview phase so, although you can still do a number of very cool things with it, you can't deploy arbitrary workspaces on arbitrary grids... yet. However, we're getting much closer, and in future blog posts I'll explain what progress we're making towards that goal.



When we do get there, I believe that workspaces stand to make really exciting contributions to Computer Science and Engineering education. Not only can they facilitate access to computational resources by underprivileged institutions, they can also enhance existing curriculums by enabling students to gain more practical experience than before (e.g., by giving each student their own cluster). In fact, workspaces will enable the creation of more complex "playgrounds", from virtual clusters to virtual grids, that students can use to learn and experiment.

Tuesday, October 16, 2007

Does your grid make Fords or Volvos?

By Ivo Janssen

Ask a user why they use a grid, a cluster, or any other type
of distributed system and you’ll hear, “Why, to get my work done faster, of
course.” But that’s an ambiguous statement at best, since it can mean two things:
faster runtimes or higher throughput. And although they might seem similar,
they’re really not.



Runtime is defined as the wallclock time it takes to
complete one task. If you parallelize a task, for instance with MPI, or by
taking advantage of the data splitting capabilities of Grid MP, you can get
your job back in less time. If you can parallelize your job into 10 parallel
sub-jobs and run it on 10 nodes, you can expect that job to complete on average
in 1/10th of the time. Plus a bit of overhead of course, but let’s keep it
simple for now.  In Volvo’s innovative Uddevalla
plant, groups of workers assemble entire automobiles in less time than it takes
for one worker to complete a whole car. So with 10 workers in a group, you
could potentially make a car in 1/10th of the time.



However, sometimes your task cannot be parallelized any further,
but you might have lots of them pending. Grids can still help since they can
increase the throughput of your jobs. Queuing theory states that with 10 nodes
and 10 jobs, you can still expect a unique job to complete on average in 1/10th
of the runtime of a single job, without using any parallelism. In a traditional
American automotive plant, the car advances on the assembly line and at no
point more than one operator is working on one car, so there’s no parallelism
involved. It might take up to a day before one car is completed from start to
finish, but a new car rolls off the end of the line every few minutes.



So next time when a user brags about his fancy new cluster,
ask him whether he’s producing Fords or Volvos.

Thursday, October 11, 2007

Virtual Grid Nodes: The Tension

By Roderick Flores

Lately I have been putting a lot of thought into the challenges that grid managers face in building an enterprise grid.  Primarily they must support the various stakeholders throughout the enterprise, each of whom has their own sets of application workflows used to meet their business needs. 



The software packages that each interested group uses may have a significant overlap with one another, but the similarity stops there.  Because each group ostensibly has a different goal, the usage patterns are almost guaranteed to be unique.  This implies that the community as a whole will demand any of the following:



  • A wide range of operating systems including Linux, Microsoft Windows, or any of the varied flavors of Unix;
  • Support for multiple versions of the same software package; and
  • A wide range of operating environments particularly with respect to memory, CPU performance, network usage, and storage.


When you consider users’ needs in more detail, you will recognize that a number of implications further complicate things:



  • The set of applications that users wish to run will likely run under a two or more different major OS revisions (e.g. Linux kernel 2.4 versus 2.6 or Windows XP versus Vista);
  • Similarly, there are applications that steadfastly refuse to run under a specific patch level.  For example, a minor revision of the Linux kernel that is lacking a specific security patch might be required.  You might be able to force the software to install but then the software is likely to no longer be supported;
  • Off-the-shelf installations which seek to upgrade rather than coexist with a previous version;
  • Custom software that expects a very specific behavior from a package that has changed in its most recent update;
  • Software which requires particular kernel tuning which is not appropriate for general operation; and
  • Software packages which have 32/64-bit library compatibility issues;


Meanwhile, grid managers will most likely be focused on providing a stable, secure, and easy to maintain infrastructure that is both cost-effective and capable of meeting the users’ core requirements.  Clearly the priorities between the individual groups and the support team will be at odds much of the time.



The most elegant solution to these issues is to build a grid whose execution environments are all virtualized.  In this situation, each usage pattern would have its own environment tailored to its own unique needs while the core OS would be under the complete control of the infrastructure staff.  Clearly there would be a stakeholder driven set of virtual servers available for use on each node in the grid. 



It seems simple enough: rather than creating a complicated infrastructure that will not accommodate all of the situations your users will require, you simply will give them their own isolated operating environments.  As you might expect, nothing is that straightforward.  The standard tools that you use for grid and virtualization management do not work well in this architecture.



In future posts, we will explore the challenges and possible solutions in detail. In particular we will focus on:



-    Networking
-    Virtual Server Management
-    Job Scheduling
-    Performance Monitoring
-    Security
-    Data Lifecycle

Friday, October 5, 2007

Scripting Grid Engine Administrative Tasks Made Simple

By Sinisa Veseli

Grid Engine (GE) is becoming increasingly popular software for distributed resource management. Although it comes with a GUI that can be used for various administrative and configuration tasks, the fact that all of those tasks can be scripted is very appealing. The GE Scripting HOWTO document already contains a few examples to get one started, but I wanted to further illustrate the usefulness of this GE feature with a simple example of a utility that modifies shell start mode for all queues in the system:


#!/bin/sh

# Utility to modify shell start mode for all GE queues.
# Usage: modify_shell_start_mode.sh 
#  can be one of unix_behavior, posix_compliant or script_from_stdin

# Temporary config file.
tmpFile=/tmp/sge_q.$$

# Get new mode.
newMode=$1

# Modify all known queues.
for q in `qconf -sql`; do
# Prepare queue modification.
echo "Modifying queue: $q"
cmd=”qconf -sq $q | sed 's?shell_start_mode.*?shell_start_mode $newMode?' > $tmpFile”
eval $cmd

# Modify queue.
qconf -Mq $tmpFile

# Cleanup.
rm -f $tmpFile
done



Using the above script one can quickly modify the variable for all queues without having to go through the manual configuration steps.


The basic approach of 1) preparing new configuration file by modifying the current object configuration, and 2) reconfiguring GE using the prepared file, works for a wide variety of tasks. There are cases, however, in which the desired object does not exist and has to be added. Those cases can be handled by modifying the EDITOR environment variable and invoking the appropriate qconf command. For example, here is a simple script that creates set of new queues from the command line:


#!/bin/sh

# Utility to add new queues automatically.
# Usage: add_queue.sh   …

# Force non-interactive mode.
EDITOR=/bin/cat; export EDITOR

# Get new queue names.
newQueues=$@

# Add new queues.
for q in $newQueues; do
echo "Adding queue: $q"
qconf -aq $q
done



Utilities like the ones shown here get written once and usually quickly become indispensable tools for experienced GE administrators.

Friday, September 28, 2007

The Grid and Hosting

By Rich Wellner

Jeremy Sherwood from opus:interactive has a good write up of HostingCon 2007.



My experience with Grid Computing goes back to the late 1990s with distributed.net in helping making encryption that much secure. With the technology originally designed to harness unused CPU cycles to solve complex problems, to now being used to hosting an infinite number of hosting environments. It is amazing the level of reliability and scalability options that are available with the system. The ability to grow in resources at an unlimited rate -on the fly- with little to no exposure to change, is outstanding. The other great aspect of this system of technology is the ability to contribute to a sustainable mindset. If done properly, you can reuse old servers and hardware that in a normal life cycle would be recycled, now can be reprovisioned back into a production environment with little concern of impact of hardware failure. This rejuvenation of hardware opens up a great opportunity to get that-much-more out of your initial investment as well as being able to pass those saving onto the customer.

Three Reasons Why High Utilization Rates Are the Wrong Thing to Measure

By Rich Wellner

Attending the working sessions at various conferences I hear a theme over and over again, "how can grid computing help us meet our goal of 80% utilization"? People post graphs showing how they went from 20% utilization to 50% and finally 80%. People celebrate achievement of this number as an axiom. The 80% utilized cluster is the well managed cluster. This is the wrong goal.



The way to illustrate this is to ask how 80% utilization brings a new drug to market more quickly? How does 80% create a new chip? How does 80% get financial results or insurance calculations done more quickly?



Of course, it does none of those things. 80% isn't even a measure of IT efficiency, though most people use it as such. It's only a statistic that deals with a cluster itself. It is, however, measurable, so it's easy to stand up as an objective that the organization can meet. The question to ask is, does an 80% target actually hurt the business of the company?



That target has three problems:

  • It takes the focus off the business problem the clusters are solving
  • Most people choose the wrong target (80%, rather than 50%)
  • We would fire a CFO who only measured costs, we are we willing to only measure them here?



If your clusters are running at 80% that means that you have a lot of periods when work is being queued up and waiting. Think about the utilization pattern of your cluster. Almost every cluster out there is in one of two patterns. They are busy starting at nine in the morning when people start running work and the queue empties overnight. Or, they are busy starting at three in the afternoon when people have finished thinking about what they need to run overnight and the queue empties the next morning.



During the times when the queues are backed up, you are losing time. These jobs waiting represented people who are waiting, scientists who aren't making progress, portfolio analysts who are trailing the competition and semiconductor designers who are spending time managing workflow instead of designing new hardware.



For most businesses it's queue time and latency that matters more than utilization rates. Latency is the time that your most expensive resources, your scientists, designers, engineers, economists and other researchers are waiting for results from the system. Data centers are expensive. Don't get me wrong, I'm not arguing that it's time to start throwing money at clusters without consideration. It's just that understanding the way the business operates is critical to determining what the budget should be. Is the incremental cost of having another 100 or 1000 nodes really more than the cost of delaying the results that your business needs to remain viable?



Don't be willing to be the manager that measures what is convenient rather than what is valuable to the future of your business. Be 'savvy' in your approach. Find ways to understand the behavior of your drug discovery processes on your clusters, even if you are an IT guy instead of a computational chemist. Find ways to demonstrate how your approach of reducing cluster latency is turning up the heat on the next chip design. Find ways to measure what keeps your business around so that you can be part of the process of creating value instead of viewed by that CFO as nothing more than a cost center to be optimized away.



The message is that cost is only one part of the equation. Likely, it's even a minor part of the equation. Don't get yourself lost measuring the price of your stationary when it's the invoices you're putting in the envelopes that matters.

Thursday, September 27, 2007

Warning: Don't Patronize Your Users

By Rich Wellner

One of my favorite quotes is from E.B. White:



No one can write decently who is distrustful of the reader's intelligence, or whose attitude is patronizing



Pawel Plaszczak and I certainly took this sort of goal seriously when we wrote our Savvy Manager's Guide. You should take it seriously when you design your grid.



The single biggest mistake people make is to not trust their users to provide reasonable requirements. Designers and architects go out and talk to users, then write-off the feedback they get as being general guidance, rather than hard requirements.



Google, as an example, took their users seriously from day one. They could have created yet another site so littered with ads that it was unreadable, but instead created a user experience that is now the subject of design classes. You can do the same. Talk to your users. Spend a day understanding how they interact with their system. Get a bit deeper into the business issues that justify the IT expenses that feed your children and pay your mortgage.



Take your users seriously, feel their pain and be their hero.

Wednesday, September 26, 2007

Stop Wasting Time

By Rich Wellner

Seth wrote:

The traffic engineers in New York think nothing of wasting two minutes of each person's time as they approach a gated toll booth. Multiply that two minutes times 12,000 people and it's a lot of hours every day, isn't it?



The truth in this is obvious, and it applies to the grid also. I've talked with folks that has hundreds or engineers each spending a third of their time managing jobs and data on their clusters. That's a lot of time that could be spent advancing their business wasted. Even if just on an expense basis, that's $3M in labor costs. And that's on the low end.



There were significant wins in moving from SMP boxes to clusters. There were significant wins in moving from clusters to grids. Now it's time to realize the next win by managing your grid effectively.

Tuesday, September 25, 2007

Why Are 92% Of Users Waiting for I/O?

By Rich Wellner

stoplight_000004020985XSmall.jpgIDC recently published a finding that 92% of users have applications that are I/O constrained. That's a shockingly large number given the options that exist for reducing this pain. Let's break this down into three major categories:



  • Working storage
  • Near-line storage
  • Long-term storage


The nature of each of these domains is very different and the options available to reduce the problems are similarly different.



Working Storage



Administrators unfamiliar with the demands of certain classes of applications will sometimes mount their scratch disk as (oh, the horror!) NFS. Over the past five years the average knowledge level has crept up as the community has grown and gained experience, but I have seen in the last couple months that there are still clusters out there than are doing significant I/O to slow network scratch.



Near-line Storage



Typically consisting of disk, NAS or SAN devices people have a tendency to not buy enough bandwidth (either in the form of network I/O or aggregate disk I/O) or view the purchase of these facilities as one time expenses, failing to keep up with their users expanding demand as time progresses.



Long-term Storage



Migrating data to tape is the only game in town for long term, high capacity storage (so far the decade old promise of using disk for long term storage still seems to be a decade away). The problem is that with drives and automated libraries costing enormous amounts of money, coupled with the latencies inherent in this type of storage, applications are left sitting tapping their feet waiting for data to stream in.



The Solution



The services in the grid must be programmed to be aware of the data they require. An early example of this is the DDM system in incubation in dev.globus. This system knows what data is located in different resources on the grid and can thus be integrated with workflow and scheduling systems to pre-stage data to working storage before the application is started. This completely eliminates two of the three I/O constraints, and the last one is the easiest and cheapest. Just stop mounting your scratch on NFS...

Saturday, September 22, 2007

Why You Should Never Rip and Replace

By Rich Wellner

construction_000004081191XSmall.jpg

At Univa UD we deal with a variety of different customers. These folks are trying to solve business problems in a semiconductor, life science, financial services, big science and lots of other sectors. What they have in common is that they have existing infrastructure and are hyper-concerned about business disruption while moving in a new direction. They should be.



There are a lot of ways to approach grid computing that require you to replace what you have with something new. This is particularly the case for vendors of proprietary tools. These tools are built on proprietary protocols that make it difficult to integrate other services or applications. Combine these two issues and it can be tough to get anything bigger than a cluster up and running. If you already have a cluster, or more, up and running, this disruption will have a real impact on your ability to accomplish your goals.



To borrow an old saying, you want your approach to be evolutionary rather than revolutionary. This means moving in a new direction using a phased delivery that allows existing work or research to continue without interruption.



With Globus, this is achieved by creating an additional layer atop existing resources. A common security platform is built on local security layers. A common job submission mechanism replaces product specific ones. A monitoring system that can aggregate information from multiple sources replaces those that only report data from their specific resource.



With these steps in place, new users, applications and clusters can be provisioned in ways that allow flexible cluster usage, better aggregate throughput and higher cluster utilization rates. Then, as time permits, existing applications -- and particularly scripts and workflows -- can be ported from their existing platform to interfaces that will allow them to utilize all the bandwidth available in the organization. Dig?

Sunday, July 29, 2007

Why the grid is still important

By Rich Wellner

Grid computing is celebrating 11 years next month, and is poised to
become increasingly mainstream in the coming years.  There are a number
of reasons that this is true, and most of them are the time tested
ideas that have been proving themselves in your research institutions
and businesses for years.  The grid is about allowing your organization
to run more efficiently and more effectively than can be done with more
conventional technology solutions.  It's about bringing many machines
together in coordination around a task.  It's about bringing data
storage and movement to bear in a coordinated fashion with your
application.  It's about allowing people from different parts of your
organization to work together more easily.