Tuesday, November 27, 2007

Unleash the Monster: Distributed Virtual Resource Management

By Roderick Flores

Recently we explored the concept of a virtualized grid: a system where the computation environment resides on virtualized operating environments. This approach simplifies the support of the grid-user community’s specialized needs. Further, we discussed the networking difficulties that arise from instantiating systems on the fly including routing and the increased network load distributing images to hypervisor would create. However we have not yet discussed how these virtualized computational environments would come to exist for the users at the right time.



The dominant distributed resource management (DRM) products do not interact with hypervisors to create virtual machines (VMs). Two notable exceptions are Moab from Cluster Resources and GridMP from Univa UD.  Moab supports virtualization on specific nodes using the node control command (mnodectl).  However they are not created on the available nodes as needed.



Consequently grid users who wish to execute their jobs on a custom execution environment will have to follow this procedure:



  • Determine which nodes were provided by the DRM's scheduler.  If any of these nodes are running default VMs for other processes, these may need to be modified or suspended in order to free up resources;
  • Create a set of virtual machines on the provided nodes;
  • Distribute their computation jobs to each of those machines once they are sure they have entered a usable state;
  • Monitor computation jobs for completion; and
  • Finally,once you are certain the jobs are complete, tear down the VMs.  You may be required to restore any VMs that existed before you started.


Sadly, the onus is on the user to guarantee that there are sufficient images for the number of requested nodes.  They are also required to notify the DRM which resources it will take during the computational process.  If this is not done, additional processes could be started on the same node and resource contention could result.



In addition to the extra responsibilities put upon the grid user, they will also lose many of the advantages that resource managers typically offer.  There is no efficiency associated with managing the resources  of the VMs beyond their single use.  If a particular environment could be used repeatedly, that operation must be managed by the user.  Also, the DRM can only preempt the job that started the virtual machines and in turn the computational jobs.  If this process is preempted, then neither the computational job nor the VMs will be  affected.  If other jobs are typically run on a default VM, there could be issues.  Finally, the user may lose some of the more sophisticated capabilities built into the resource manager (such as control over parallel environments).



All of these issues could be solved by tightly integrating the DRM with the dominant VM hypervisors (managers).  The DRM should be able to start, shutdown, suspend, and modify virtual environments on any of the nodes under its control.  It should also be able to query the state of the physical machine and all of its operating VMs.  Ideally either the industry and/or our community would come to consensus on an interface that all hypervisors should expose to the DRM.  If we put our minds to it, we could describe any number of useful features that a DRM could provide when integrated with virtual machine managers; these concepts simply need to be realized to make this architecture feasible. 



Here are my thoughts about what a resources manager in a virtualized environment might provide:



  • It could be able to rollback an image to its start state after a secure process was executed on it.
  • It could be aware of the resources each VM were limited to so that it could most efficiently schedule multiple machines per physical node.
  • It should distinguish between access-controlled VMs versus public instances to which it may schedule any jobs.
  • It should stage the booting of VMs so that we do not flood the network by transferring operating system images.  A sophisticated DRM might even transport images to local storage before the node's primary resources are free.  Readers of the previous posts will recall that the hypervisor interactions should be on a segregated network so as not to interfere with the computational traffic. 
  • It could suspend VMs as an alternative to preempting jobs.  Similarly, it could suspend a VM, transport its image to another physical node, and restart it.  If the DRM managed output files as resources, it could prohibit other processes from writing to the files still open from the suspended systems.
  • It could run specialized servers for two-tier applications and modify the resource allocation for the VM should it become resource constrained.


I am sure that other grid managers could improve on as well as append this list with other excellent ideas. 



In summary, we have examined the flexibility that a grid with virtualized nodes provides.  As clusters evolve from dedicated systems for a homogeneous user community into grids serving a diverse set of user requirements, I believe that grid managers will require the virtualized environment that we have been exploring.   Clearly the key to creating this capability is to integrate hypervisor into our resources managers; without it, VM management is simply too complicated for the scale we are targeting.



Thus far, nothing that we have explored helps us manage and describe the dynamic system that this framework requires (as I am sure you have noticed).  Is this architecture a Frankenstein's monster that will turn on its creators?  That said, next time we will explore how we might monitor  and create reports for a system that changes from one moement to the next.   

Monday, November 19, 2007

Five Ways to Improve Your Hiring Tactics

By Rich Wellner

The company I work for, Univa UD, is hiring and I was sitting down with one of the managers to talk about approaches. Since long before I joined Univa UD, I've been very interested in recruiting as I ran a few small companies and hired on an international basis. Recruiting is the single most important thing that we do. Everything else -- serving our customers, building insanely great products, profiting or creating a fun workplace -- is the result of hiring well.



Talking with that hiring manager, we put together the essential five tactics to managing the candidate acquisition process:



  • When using an online system, buy a multi-month plan. Even if you are feeling a cost crunch, it's unrealistic to believe that the right candidate will walk through the door during the first couple weeks that you're engaged in a search. If you are staffing more than one position, this is even more true. We will be hiring for more than a month, we're going to buy access for more than a month.
  • Spread the work across multiple hiring managers. Recruiting is work. This is important, so I'll say it again. Recruiting is work. Treat it like the important work that it is and make sure the right people are involved in the process. If there are folks who are wordsmithing geniuses get them involved in the production of the posts. If you have people that are brilliant at interviewing, make sure they are talking with candidates even if they will report to another manager. Conversely, your companies success depends on everyone performing well. Be generous with your time and add value to the hiring processes of the other teams in your company.
  • Spend a couple hours reading sites like Copy Blogger. It has great tips on making your writing better and, let's face it, a help wanted ad is marketing. We need to stand out among the thousands of other companies if we want to attract the best people.
  • Update your ads at least once each week. This shouldn't be a rewrite, but each time you update your ad it pops back to the top of the search stack. This may seem like gaming the system, but it works. When I had a break in hiring this summer and I stopped doing this I noticed immediately the tail off in responses as each week ticked by.
  • Don't post a requirements list, tell a story. The posting may well be the only contact you have with people. The posting needs to draw them in and compel them to make the next step and respond to your ad.



And, in the spirit of giving out a free lunch for thanksgiving, a bonus tip:



  • Find more ways to to get the word out. Use your blog, LinkedIn account or Facebook to let people in your community know that you are hiring. Reach out to people in as many ways as you can think of, the IT market is competitive again and you can't stand still waiting for people to come to you and expect to make great hires. Get out there and be great at recruiting, it's the most important thing you can do!




Friday, November 16, 2007

How to Improve qconf Productivity

By Ivo Janssen

qconf
is without a doubt already a very powerful tool when it comes to administering you
Grid Engine installation, but with a little shell foo, it can become even
more powerful.



How often have you started to type 'qconf -mq ' before realizing you
don't know the exact name of your queue, necessitating a quick 'qconf -sql'
first. After Dan
Templeton
shared a very useful Grid Engine Cheat Sheet with me a few
weeks ago (also see this
announcement
on gridengine.info), I realized that many commands share the same drawback.



Well, bash's autocompletion
framework
can be put to good use
here. See, many people don't know that bash's autocompletion can complete just
about anything, not just filenames.



You can download my qconf_completion.sh script from my website, simply dump it in e.g. $SGE_ROOT/util and add the following line to the end of your
$SGE_ROOT/$SGE_CELL/common/settings.sh:


. $SGE_ROOT/util/qconf_completion.sh



Now you go from the wieldy:

qconf -mq

qconf -mq^H^H

qconf -sql

qconf -mq thisodd.q



To:

qconf -mq t[TAB] and you're done.



So far I've only implemented most of the interactive options, and have not
worked on autocompleting options such as -aattr or -mattr yet, although I see
huge possibilities for productivity improvement there.
By the way, this exercise was a great way to re-appreciate
the intricate regularity that underlies the option set.



Enjoy, and leave a comment if you like the script or have suggestions for
improvement.

Wednesday, November 7, 2007

Hookin' Up is Hard to Do

By Roderick Flores

Previously we discussed the tension that grid managers face when supporting various stakeholders on an enterprise grid.  In particular we concluded that providing isolated virtual operating environments to each of the business units operating in your environment would be the easiest way to meet their competing and divergent needs.  In this post we will explore the networking challenges that a grid of virtualized systems poses.



The primary challenge you face in this architecture is how to connect it all together.  At first glance it seems simple enough: take your current grid, install a hypervisor on each of its nodes, and then start implementing your user’s specific environments.  Sadly, this will probably not work.



In a typical grid you already have to consider the challenges of connecting several hundred compute nodes to one another and a storage network while keeping network latency low. 



In order to illustrate the networking problems you would have in a virtualized grid, consider a system with a significant number of nodes used by several operational units.  For example, imagine a large financial services company that provides banking, brokerage services, insurance, mortgage, and financing.  Each of these business lines, while related, has their own distinct set of business application workflows.   While there may be some overlap of the specific applications used by each of the units, there is little guarantee that each group will use those applications in the same way let alone use the same versions. Worse yet, a business unit may have multiple operational workflows which do not operate in similar environments (e.g. windows versus Linux specific applications suites).  Finally, we grid managers would like to have development, test, and production instances segregated but running on the same hardware . 



It is easy to project having to support at least ten times more virtual than physical operating environments.  The actual number should be proportional to the number of unique operating environments required by the users. In a standard grid you have a fixed set of computational resources that are reasonably static; in other words systems do not appear and disappear on a regular basis.  However in the virtualized grid, operating environments are going to appear and disappear as a function of the business workflows scheduled by your users.  You can imagine how quickly this can become complicated.



What is the best way to deliver these operating environments to the physical hardware?  If we keep all of the images on local disk then we need to guarantee that there is sufficient disk space on each node; a practice which not only can be costly but does not scale well.  If we choose to keep no more than the maximum number of nodes supported by any application in each operating environment, we can reduce the number of virtual machines we require.  Of course this implies that these images are either stored on a SAN or are transported to the individual physical nodes before booting the virtualized environment.  Sadly, both of these approaches significantly increase network loads.  We will discuss scheduling and managing individual virtual machines in subsequent posts.



How do we connect these virtual environments? If these systems were on segregated physical hardware (think Microsoft Windows versus Linux) we would likely keep them on their own network and/or VLANs.  After all, these environments generally should not interact with one another.  Consequently, shouldn’t we also do this for the virtualized grid?  If we chose not to and instead used DHCP based upon physical topology to provide addresses to the virtualized environments, we could quickly run into trouble.  Specifically, a single job executed on n nodes could conceivably land on n distinct networks and/or VLANs.  This would significantly increase the size of the broadcast domain as well as require more work from your network switches.  Therefore it would add significant latency to all communications between the nodes. Clearly this is a poor choice unless you are always using most of your nodes for each job.



Thus my preferred solution is to segregate operational environments, so that every physical node bridges traffic for several distinct networks over the same interface.  Addresses would be assigned by virtual MAC addresses rather than physical location.  As in the counter-example, this occurs because we will not be able to guarantee where on the physical network topology a particular job is scheduled.  In fact, we probably want to use VLAN tags on our packets so that our switches could more efficiently operate.  Additionally if your grid nodes have secondary interfaces, all communication with the hypervisor should be segregated to its own management network.



If this has not scared you away from the concept of  the virtualized grid (I hope it hasn’t), we will continue to explore other hurdles inherent with this architecture in future posts.

Monday, November 5, 2007

How to Decipher Grid Engine Statuses – Part II

Sinisa Veseli

In Part I of this article I’ve discussed meanings of various queue states that one might see after invoking the Grid Engine qstat command. The list of possible job states is just as long as the list of queue states:



• d (deletion) — Indicates that a job has been deleted using qdel.



• r (running) — Indicates that a job is about to be executed or is already executing.



• R (restarted) — Indicates that the job was restarted. This state can be caused by a job migration or because of one of the reasons described in the -r section of the qsub man page.



• s (suspended) — Shows that an already running job has been suspended using qmod.



• S (suspended) — Show that an already running job has been suspended because the queue that it belongs to has been suspended.



• t (transferring) — Indicates that a job is about to be executed or is already executing.



• T (threshold) — Show that an already running job has been suspended because at least one suspend threshold of the corresponding queue was exceeded, and that the job has been suspended as a consequence.



• w (waiting) — Indicates that the job is suspended pending the availability of a critical resource or specified condition.



• q (queued) — Indicates that the job has been queued.



• E (error) — Indicates that the job is in the error state. You can find the reason for this state using the qstat command with “-explain E” option.



• h (hold) — Indicates that the job is not eligible for execution due to a hold state assigned to it via qhold, qalter, or qsub -h command. 



Just like with queue states, one also frequently encounters various combinations of the above job states.

Thursday, November 1, 2007

Grids, grids, grids: Which side of the pond wins?

By Scott Koranda

Dan Ciruli at West Coast Grid writes



Europe is years ahead of the US in terms of large grids...



Is Europe years ahead of the US?



Open questions that come to mind include:



  • What is a "large" grid?
  • What makes one region "ahead" of another?
  • What makes one region "years" ahead?
  • If one region is years ahead, what are the reasons for it?
  • What of other regions outside of Europe and the US?


Certainly the US and Europe both have some very large grids, so the question is, what was Dan taking into account when making his claim.