Monday, March 17, 2008

All Jobs Are Not Created Equal

By Sinisa Veseli

Choosing a distributed resource management (DRM) software may not be a simple task. There are a number of open source or commercial software packages available, and companies usually go through product evaluation phase in which they consider factors like software license and support costs, maintenance issues, their own use cases and existing/planned infrastructure, etc. After following this (possibly lengthy) procedure, and finally making the decision, purchasing and installing the product, you should also make sure that the DRM software configuration fits your cluster usage and needs. In particular, designing the appropriate queue structure, configuring resources, resource management and scheduling policies are some of the most important aspects of your cluster configuration.

At first glance devoting your company's resources into something like queue design might seem unnecessary. After all, how can one go wrong with the usual "short", "medium" and "long" queues? However, the bigger your organization is and the more diverse computing needs of your users are, the more likely it is that you would benefit from investing some time into designing and implementing queues more efficiently.

My favorite example here involves high priority jobs that must be completed in a relatively short period of time, regardless of how busy the cluster is. Such jobs must be allowed to preempt computing resources from other lower priority jobs that are already running. Better DRMs usually allow for such use case (e.g., by configuring "preemptive scheduling" in LSF, or using "subordinated queues" in Grid Engine), but this is clearly something that has to be well thought through before it can be implemented.

In any case, when configuring DRM software, it is important to keep in mind that not all jobs (or not all users for that matter) are created equal...

Tuesday, March 11, 2008

All of Your Data in One Basket

By Roderick Flores

I once worked with this person who wrote programs that only wrote to a single file. Once this program was put into the grid environment it would routinely create files that were hundreds of gigabytes in size.  Nobody considered this to be a problem because the space was available and the SAN not only supported files of that size, but also performed amazingly well considering the expectations. While this simplifies the code and data management, there are a number of reasons why this is not a good practice.



  • You don’t always need all of the output data at once. Moving a piece from the grid to your desktop for testing would not even be a consideration.
  • The amount of computation-time needed to recreate a huge file is significant.
  • There is no easy way to get to use multiple threads for writing and/or reading data.
  • Moving files across the network takes a lot more time.
  • A file can only be opened in read-write mode by one process at a time.  One large file is going to block a lot more modification operations than several single files.
  • Backing the file up is remarkably more difficult.  You cannot just burn it to a DVD so it has to be sent to disk or to tape.  If you need to restore a file it can take a significant amount of time.
  • Your file is going to be severely fragmented on the physical drives and therefore will cause increased seek times.
  • You can no longer use memory-mapped files.
  • Performing a checksum on a large file takes forever.
  • Finally, if you had properly distributed the job across the Grid, you should not have such large files!!!


Why would anybody do such a thing?  All your data are belong to us?

Wednesday, March 5, 2008

Four Reasons to Attend the Open Source Grid and Cluster Conference

By Rich Wellner

We're combining the best of GlobusWorld, Grid Engine Workshop and Rocks-a-Palooza into one killer event in Oakland this May. Here's why you should come to the Open Source Grid and Cluster Conference:



  • Great Speakers: We're going to have the rock stars of the grid world speaking and teaching.
  • Great Topics: Dedicated tracks to each of the communities being hosted.
  • Community Interaction: The grid community is spread all over the world, this will be a meeting place to get face time with the people you know by name only.
  • You Can Speak: We're currently accepting agenda submissions for 90 minute panels and sessions.
This should be a fantastic conference, I'll look forward to meeting you there.

Monday, March 3, 2008

Grid vs Clouds? Who can tell the difference?

By Sinisa Veseli

The term "cloud computing" seems to be attracting lots of attention these days. If you google it, you'll find more than half a million results, starting with Wikipedia definitions and news involving companies like Google, IBM, and Amazon. There is definitely no shortage of blogs and articles on the subject. While reading some of those, I've stumbled upon an excellent post by John Willis, in which he shares what he learned while researching the "clouds".



One interesting point from John's article that caught my eye was his regard of virtualization as the main distinguishing feature of "clouds" with respect to the "old Grid Computing" paradigm ("Virtualization is the secret sauce of a cloud."). While I do not disagree that virtualization software like Xen or VMware is an important part of today's commercial "cloud" providers, I also cannot help noticing that various aspects of virtualization were part of grid projects from their beginnings. For example, SAMGrid, one of the first data grid projects that served (and still serves!) several of Fermilab's High Energy Physics experiments since the late 1990's, allowed users to process data stored in multiple sites around the world without requiring users to know where the data will be coming from, and how will it be delivered to their jobs. In a sense, from physicist's perspective experiment data was coming out of the "data cloud". As another example, "Virtual Workspaces Service" has been part of the Globus Toolkit (as incubator project) for some time now. It allows an authorized grid client to deploy an environment described by the workspace metadata on a specified resource. Types of environments that can be deployed using this service range from atomic workspace to a cluster.



Although I disagree with John's view on the differences between the "old grid" and "new cloud" computing, I still highly recommend the above mentioned article, as well as his other posts on the same subject.