Friday, December 21, 2007

Grid Truths

By Rich Wellner

Thanks to Brand Autopsy, I'm reminded of Google's Ten Things statement. Some of the things they consider to be core to their success are also core to making grid computing work.



Focus on the user and all else will follow



For Google this means a clean interface, fast response and honest results. Globus Toolkit and Cluster Express follow this idiom really well.



In the case of Globus Toolkit, their focus has long been allowing users to gain access to a wide variety of resources in a user centric model. The philosophy should be that the user get a single sign-on and be able to use that to move data, get monitoring information, submit jobs and delegate authority to other entities on the grid.



For Cluster Express the idea is to take Globus Toolkit and make it dead simple to install even while combining it with some of the other best open source tools around like Grid Engine and Ganglia. These are the tools users and administrators need to operate a cluster. The result has been a thousand downloads in the first few weeks this user focused package has been available.



It's best to do one thing really, really well



Grid computing is about managing resources effectively. That's it. We're not about making hardware, finding oil reserves, curing cancer or projecting financial markets. Because of that application agnosticism, grid computing works in all those domains and hundreds more.



Democracy on the web works



Grid computing, especially in the open source world, works via standards and many people working together on many different pieces of a complete solution. Scientific users like those at Argonne National Lab and Fermilab collaborate on data storage and movement standards. The best ideas really do win and the results are solutions that scale to billions of files and transfer rates at theoretical maximums even on multi-gigabit rates.



You don't need to be at your desk to need an answer.



With previous generations of cluster management tools, an admin or user had to be at their own computer to understand in any detail what state their jobs were in. Even though the grid allows them to access resources across multiple computers, it still also allows them to manage their workload from web portals, releasing them from any particular desk.



The need for information crosses all borders.



One of the largest stumbling blocks users used to face in HPC was getting access to the data they needed. Tools like Reliable File Transfer Service, Replica Location Service and GridFTP allow information to be scheduled and moved on a global basis.



You can be serious without a suit.



Amen.



Great just isn't good enough.



Grid computing is 11 years old now and has been helping to do great science for most of them. But the developers and users keep pushing the envelope and finding new and better ways to get more done.



Expect that to continue next year.

Monday, December 17, 2007

Top Four Things Cisco Learned Working on Open MPI

By Jeff Squyres

This entry was written by guest blogger Jeff Squyres from Cisco Systems. I met him at SC07 when I attended his Open MPI presentation in the Mellanox booth. He did a great job, much better than most of the presentations at tech conferences, and agreed to share some of his thoughts on how big companies can work effectively in an open source project with our readers.



The general idea of my talk is to help address the answer "Why is Cisco contributing to open source in HPC?" Indeed, much of Cisco's code is closed source. Remember that our crown jewels are the various flavors of IOS (the operating system that powers Cisco Ethernet routers); many people are initially puzzled as to why Cisco is involved in open source projects in HPC.



The short/obvious answer is: it helps us.



Cisco is a company that needs to make money and has a responsibility to its stockholders. We sell products in the HPC space and therefore need a rock-solid, high-performance MPI that works well on our networks. Many customers demand an open source solution, so it is in our best interests to help provide one rather than partially or wholly rely on someone else to provide one. In particular, some of these interests include (but are not limited to):



  • Having engineers at Cisco who can providing direct support to our customers who use open source products
  • Being able to participate in the process and direction of open source projects that are important to us (vs. being an outsider)
  • Leveraging the development and QA resources of both our partners and competitors -- effectively having our efforts magnified by the open source community (and vice versa)
  • Shortening the time between research and productization; working directly with our academic partners to turn today's whacky ideas into tomorrow's common technology


Think of it this way: only certain parties can mass-produce high quality hardware for HPC (i.e., vendors). But *many* people can help produce high quality software -- not just vendors. In the context of this talk, customers (including research and academic customers) have the expertise and capability to *directly* contribute to the software that runs on our hardware. HPC history has proven this point. We'd therefore be foolish to *not* engage HPC-smart customers, researchers, academics, partners, competitors, ...anyone who has an HPC expertise to help make our products better. I certainly cannot speak for others, but I suspect that this rationale is similar to why other vendors participate in HPC open source as well.



Let's not forget that participation in HPC open source helps everyone -- to include the overall size of the HPC market. Here's one example: inter-vendor collaboration, standardization, and better interoperability means happy customers. And happy customers lead to more [happy] customers.



We have learned many things while participating in large open source projects. Below are a few of the nuggets of wisdom that we have earned (and learned). In hindsight, some are obvious, but some are not:

  • Open source is not "free" -- someone has to pay. By spreading the costs among many organizations, we can all get a much better rate of return on our investment.
  • Consensus is good among the members of an open source community (e.g., some members are only participating out of good will), but not always possible. Conflict resolution measures are necessary (and sometimes critical).
  • Just because a project is open source does not guarantee that it is high quality. Those who are interested in a particular part of a project (especially large, complex projects where no single member knows or cares about every aspect of the code base) need to look after it and ensure its quality over time.
  • Differences are good. The entire first year of the Open MPI project was a struggle because the members came from different backgrounds, biases, and held different core fundamentals to be true. It took a long time to realize that exactly these differences are what make a good open source project strong. Heterogeneity is good; differences of opinion are good. They lead to discussion and resolution down to hard, technical facts (vs. religion or "it's true because I've always thought it was true"), which results in better code.



True open source collaboration is like a marriage: it takes work. A lot of hard, hard work. Disagreements occur, mistakes happen, and misunderstandings are inevitable (particularly when not everyone speaks the same native language). But when it all clicks together, the overall result is really, really great. It makes all the hard work well worth it.

Friday, December 14, 2007

Ten Years of Distributed Computing with distributed.net

By Ivo Janssen

Grids come in many shapes and forms, and one of them is the Global Public Grid. Often presented in a philanthropic wrapper, these grids harness the power of thousands, if not hundreds of thousands of computers, often residential personal PCs. Among the most well-known to the general public are Seti@Home, Folding@Home, IBM's World Community Grid and United Devices' now-retired Cancer Research project. Apart from running the Cancer Research project as an employee of United Devices and my involvement with the World Community Grid as a vendor to IBM, I have been involved in a somewhat lesser known but much longer-running public grid project.



One of the longest-running public grid projects is distributed.net, a project that I have been part of since the inception in early 1997. Earlier this year, we celebrated 10 years of "crunching", contributing to various projects in such fields as cryptology and mathematics. It might be a lesser known project in the eyes of the larger public, but still has generated a lot of participation amongst computer enthusiasts and it even won a few awards, most notably a recognition of Jeff Lawson, my coworker and one of the founders of distributed.net, as the most notable person in IT for 1997 by CIO Magazine.



Back in 1997, distributed computing was a very novel concept, but it got jumpstarted by RSA's "RC5-32 Secret Key Challenge", which set out to prove that 56-bit RC5 was not a secure algorithm anymore due to increased speed of computers. In early 1999, distributed.net also proved that DES, another 56-bit algorithm, was getting weak by brute-forcing a secret message in 22 hours, 15 minutes and 4 seconds.



Over the years, distributed.net has undertaken 3 RC5 projects, 2 OGR projects and 3 DES projects, by utilizing over 300,000 participants running on 23 different hardware and software platforms, and it's still going strong. In October 2007, various staff members of our global team came to Austin Texas for a "Code-a-thon", working on the statistics back end to provide our "members" with better individual stats, and a new project that we're planning to roll out in the next couple months.



So if you have any spare cycles on your home computers to spare, why not give distributed.net a try.

Wednesday, December 12, 2007

Proper Testing Environments

By Roderick Flores

It continues to amaze me how many businesses do not have tiered development environments.   Moreover, many of these same companies maintain a very sophisticated production environment with strict change management procedures.  Yet somehow they feel it is apt to keep inadequate staging environments. 



However, we know better: a truly supportive development environment, in the parlance of the agile-development community, must contain a series of independent infrastructures each serving a specific risk-reduction purpose. The idea is that by the time a release reaches the production environment, all of its desired functionality should have been proven as operational.  A typical set of tiers might include:



  • Development – an uncontrolled environment
  • Integration – a loosely controlled environment where the individual pieces of a product are brought together.
  • Quality Assurance – a tightly controlled environment that mirrors production as closely as possible.
  • Production – where your customers access the final product.



Check out Scott Ambler’s diagram on Dr. Dobb's for a supportive environment to see a logical organization of this concept.



So what happens if you cut corners along the way?   I am sure you all know what I am talking about.  Here are a couple of my past favorites (in non-grid environments):




Situation:  You combine quality assurance with the integration and/or the development environments.


Result: New releases for your key products inexplicably fail in production (despite having passed QA testing) because your developers made changes to the operating environment for their latest product.




Situation: You test a load balanced n-tier product on a small (two to three machines) QA environment.


Result:  The application exhibits infrequent but unacceptable data loss because updates from one system are overprinted by those from another.  This is particular onerous to uncover because the application does not fail in the QA environment.



Presumably, we grid managers do all that we can to provide test frameworks adequate enough to avoid problems such as these. There are many texts that discuss the best practices for supportive development environments. Unfortunately, I have found that many of us forget one of our core lessons: everything becomes much more complicated at grid-scales.  Consequently, we are perfectly willing to use a QA environment scoped for a small cluster.





In particular many of us prefer to limit our QA environments to a few computation nodes.  Thus we choose to run our load tests on our production infrastructure.  Conceptually, this makes sense: we cannot realistically maintain the same number of nodes in QA as are in production, so why keep a significant number when we will end up running some of our tests out there anyway?  Sadly, this approach severely complicates performance measures.



For example, assume that my test plans dictate that I run tests ranging from one to sixty-four nodes for a particular application.  If I run this in production, I am essentially getting random loads on the SAN, network, and even the individual servers to which I am assigned.  Consequently I have to run each individual test from the plan repeatedly until I am certain that I have a statistically significant sample of grid states.  Yet I have only defined my capacity on the grid for the average of the utilization rates during my testing.  Any changes to capacity on the grid such as a change in usage patterns or the addition of resources will invalidate my results.



Clearly, I need to run the application on a segregated infrastructure to get proper theoretical performance estimates.  The segregated infrastructure, like any QA environment, should match production as closely as possible.  However, in order to eliminate external factors that seriously affect performance, it is imperative that you use isolated network equipment as well as storage.  Another advantage of this approach is that we reduce the risk of impacting production capacity with a runaway job.  Similarly it takes a large number of test runs to produce numbers that hope to ignore current load factors and thus approach theory.    Obviously this may impact the grid users’ productivity.



As we noted earlier, we cannot justify a QA environment that is anything more than a fraction of production.  However I am certain that eight nodes is not enough. Certainly QA should contain enough nodes to adequately model the speed-up that your business proponents are looking for in their typical application sets.  It would not hurt to do some capacity planning at this point.  In absence of that, thirty-two computation nodes is the minimum size I would use for a grid which is expected to contain several hundred nodes.   



Finally, once we have a reasonable understanding of the theoretical capabilities of the application, then we should re-run the performance tests under production loads.  This will help us understand the lost productivity of our applications under load.  In turn this could help justify the expense of additional resources even if utilization rates cannot. 



I know you are asking, “how do I justify the expense of a large QA environment?”  Well, just think about the time you will save during your next major change to your operating systems and how you have to test ALL of the production applications affected before you migrate that change into production.  Would you prefer to do this on a few nodes, take several out of production, or just get it done on your properly sized test environment?

Tuesday, December 4, 2007

What You Need to Know About Cluster Express 3.0

By Ivo Janssen

At Supercomputing 2007, Univa UD launched Cluster Express 3.0 beta. If you were at SC'07, you might have attended one of my demos on Cluster Express, but if you missed it, then this blog post is for you. I will cut through the marketing speak for you and, as an engineer who worked on the CE3.0 release, tell you what Cluster Express can mean to you. 



Cluster Express is designed to be your one-stop-shop for a full cluster software stack. This means we bundle the scheduler, the security framework, the cluster monitoring and an easy installer that will configure everything out of the box. On top of that, the whole solution is open sourced, including all the code that Univa UD contributed to the stack. You can go to our new community at www.grid.org and download the CE3.0 beta and its sources right now.



So let's go through all the components in more detail.



Installer
Our installer is an very simple utility that will ask you less than 5 questions, after which it will go off and install the main nodes, the execution nodes, and any remote login nodes. It then will tie all these nodes together through a bootstrap service that is installed on the main node. This lets all the other nodes retrieve configuration information from the main node. The end result is that a fully configured cluster emerges, with sensible default configuration for the Grid Engine scheduler and the Ganglia monitoring, and all the certificates for security and authentication set up properly.



Scheduler
We bundle Grid Engine and the installer will configure all the nodes in such a way that after running the installer on an execution node, this node will be part of the cluster automatically, including sensible defaults for queues, and communication and scheduling settings.



Monitoring
We bundle, install, configure, and use various cluster monitoring tools such as Ganglia and ARCo and tie everything together in a custom Monitoring UI that we wrote and delivered as part of the CE3.0 release. The Monitoring UI is not a third-party bundled tool but really a new add-on to our solution. It brings together the system level statistics that Ganglia offers with the job level statistics that ARCo logs from Grid Engine. By presenting them together in one UI, you can cross reference jobs with the nodes that they ran on, and  the loads on that host. This will allow you, for instance, to instantly realize what the impact of running a job or task is on a certain nodes, in real-time and through an easy-to-use graphical UI.



Security
We bundle and pre-configure many Globus Toolkit components such as MyProxy, Auto-CA, RFT, WS-GRAM, GridFTP and GSI-OpenSSH. Auto-CA and MyProxy are completely configured out of the box, so that the only thing you need to do is a simple myproxy-logon to acquire a token that is valid for use with all the other Globus commands such as globus-url-copy or globusrun-ws. The level of integration that we accomplished for all the GT components will definitely impress you, especially if you've been a Globus user before.



Putting it all together
As said, the full bundling of all the above mentioned components in a tarball with an easy-to-use installer now makes setting up a fully featured cluster as simple as downloading one file and running one command. This is really as easy as we could make it! And  on top of that, everything is open-sourced, including our own add-ons such as the installer and configuration scripts, and the Monitoring UI.



I hope that I can welcome you soon on our new community website around Cluster Express at www.grid.org. You can download the CE3.0 tarball there, and participate in forums, add to our wiki, or get support through our mailing lists.



I'm user "Leto" on grid.org, please don't hesitate to send me a private message there if you need any help at all.

Monday, December 3, 2007

How to Enable Rescheduling of Grid Engine Jobs after Machine Failures

By Sinisa Veseli

Checkpointing is one of the most useful features that Grid Engine (GE) offers. As status of checkpointed jobs is periodically saved to disk, those jobs can be restarted from the checkpoint in case they do not finish for some reason (e.g., due to a system crash). In this way, any possible loss of processing for long running jobs is limited to a few minutes, as opposed to hours or even days.



When learning about Grid Engine checkpointing I found the corresponding HowTo to be extremely useful. However, this document does not contain all the details necessary to enable checkpointed job rescheduling after machine failure. If you'd like to enable that feature, you should do the following:



1) Configure your checkpointing environment using “qconf -mckpt” command (use “qconf -ackpt” for adding a new environment), and make sure that the environment’s “when” parameter includes letter ‘r’ (for “reschedule”). Alternatively, if you are using the “qmon” GUI, make sure that the “Reschedule Job” box is checked in the checkpoint object dialog box.



2) Use “qconf -mconf” command (or the “qmon” GUI) to edit the global cluster configuration and set the “reschedule_unknown” parameter to a non-zero time. This parameter determines whether jobs on hosts in unknown state are rescheduled and thus sent to other hosts. The special (default) value of 00:00:00 means that jobs will not be rescheduled from the host on which they were originally running.



3) Rescheduling is only initiated for jobs that have activated the rerun flag. Therefore, you must make sure that checkpointed jobs are submitted with “-r y” option of the “qsub” command, in addition to the “-ckpt < ckpt_env_name >” option.



Note that jobs that are not using checkpointing will be rescheduled only if they are running in queues that have the “rerun” option set to true, in addition to being submitted with “-r y” option. Parallel jobs are only rescheduled if the host on which their master task executes gets into an unknown state.

Saturday, December 1, 2007

Grid Engine 6.1u3 Release

By Sinisa Veseli

Few days ago the Grid Engine project has released 6.1 Update 3 version of its software. This is a maintenance release (see the original announcement), so that the software does not yet contain the advance reservation features. However, it has quite a few interesting bugfixes. In particular, few issues with qstat output have been fixed (mostly differences between plan text and xml output), and a couple of ARCo problems have been resolved. The resolution for bug affecting users with a very large primary group entry has been made it into this release as well.



The new version of the software is available for download here.

Reservation Features Come to Grid Engine

By Sinisa Veseli

The next major update release of the Grid Engine software will contain advance reservation (AR) features (see the original announcement). This functionality will allow users or administrators to manipulate reservations of specific resources for future use. More specifically, users will be able to request new AR, delete existing AR, and show granted ARs. The reserved resources will only be available for special jobs as of the reservation start time.

In order to support the AR features the new set of command line interfaces is being introduced (qrsub, qrdel and qrstat). Additionally, the existing commands like qsub will be getting new switches, and the qmon GUI will be getting a new panel that will allow submitting, deleting, and listing AR requests. It is also worth noting that the default qstat output might change.

If you are anxious to try it out, the latest Grid Engine 6.1 snapshot binaries containing new AR features is available for download here. Note, however, that this snapshot (based on the 6.1u2 release) is not compatible with prior snapshots or versions, and that an upgrade procedure is currently not available.