Saturday, August 24, 2013

SGE Isn't For Dummies (I sort of wish it were)

Kendall Square used to have the ultimate geek book store, Quantum Books.  No fiction or graphic novels there; it was all technical books.  One could browse every O'Reilly book and many, many others.

Quantum was one of many independent bookstores plowed over (or under?) by Amazon and the Internet selling revolution.  There's a high-end bar in that space now called Mead Hall, whose menu includes a variety of fermented honeys.  Nowadays, even Barnes & Noble has a significantly attenuated technical book section, though it is still holding out.  And somewhere in there are the X for Dummies and Idiot's Guide to Y.

Now, I never liked either title.  But, the publishers of these and other books were often good at maintaining a degree of brand consistency.  Perhaps the titles were ridiculous, but the Idiot's and Dummies series were a good way to get an overview of a new subject. The standard icons could be a bit hokey, but having consistent warnings of pitfalls were worth the books price on their own.  Perhaps I'd read one through once, but they helped orient me.  O'Reilly's Learning Z would be a slower but more detailed read, and an O'Reilly Cookbook might be a desk-side companion for solving specific problems.  Some other series I learned to avoid; I forget which ones but several just grated on me or had too low an information density.

Nowadays, I tend to learn new things, or find solutions to immediate problems, online, which has its pros and cons.  On the pro side, there tends to be coverage of almost anything, and googling an error message often (but not quite always) takes you to a page that explains the circumstances of that error.  On the minus side, finding those initial overview guides can be challenging, and there is no consistency from one writer's tome to another's.  

Warp has a nice cluster which drives my analyses, and many of them are controlled by Sun Grid Engine (SGE, rarely known as Oracle Grid Engine or OGE).  Now, there are other packages out there for this and I won't make any claim that I arrived at SGE after a careful comparison.  The system came with SGE and a few key tools (such as Celera Assembler and the PacBio toolkits) inherently support it, so that drove the usage.  Later, StarCluster supported it out-of-the-box, so I got some more practice. 

SGE is great when it works.  It manages jobs across the cluster, distributing loads and making sure no machine is hammered by excessive multiple simultaneous jobs.  As far as running jobs goes, I know enough to mostly get by.  I can configure single processor and many multiple processor jobs.  I can also track my jobs' statuses in the queues.  One thing I don't really understand is how to correctly configure an OpenMPI job (namely, the wonderful assembler Ray) across multiple nodes, so within SGE I always run it on a single node.  There are also ways to make one job stay queued until another finishes, but I haven't figured that out either.

Setting up SGE is murkier water.  At least once something glitched on StarCluster in adding new nodes, and I had to get the new nodes in SGE.  With a lot of help from Google, I did succeed.  The original Warp cluster had some issues, and with a lot of help from the cluster fabricator I got some stuck nodes unstuck.  The new cluster in the new space (the old cluster was shared with another group; people tend to develop an aversion to sharing compute resources with me after a few of my misadventures, and so I was split off in the new space).  The new cluster also required some going-under-the-hood to get SGE running, but all was well.

BUT, this weekend something has gone wrong, and I haven't yet figured it out.  A few of the old tricks have failed to make the system hum, and more weirdly some jobs are firing off fine and others are stuck in permanent limbo, with no painfully obvious distinction between them.

Now, this is a reminder that my knowledge of SGE is really pretty slim, and isn't very consolidated.  It's more a bunch of tricks I know, a few of which I understand and a lot which are more follow-the-cookbook.  There's also a nagging fear that there are very important things I don't know entirely.  So a good book would be pretty useful right now.  It probably wouldn't solve my problem, but it might well prepare me better to understand my problem and go find the answer to it.

A little problem there.  Try searching Amazon under books for Sun Grid Engine.  For SGE.  For OGE. For Oracle Grid Engine.  There are no books of that sort.  No Grid Engine for Dummies.  No Idiot's Guide to Grid Engine.  No O'Reilly book with honeybees (or Giant Gerbils?). No Learn Grid Engine in 30 Days.  NADA!

I can rationalize that result relatively easily.  The publishers are only interested in the larger markets, particularly those that will attract persons who are less technically savvy or at least are not confident in their ability to pick things up.  SGE is apparently too much of a niche product in their view.  

So, more rounds of Google.  More flailing away.  Maybe some reboots - though a design feature of SGE is that rebooting doesn't reset certain error states; this is intended to prevent a seriously flawed node from becoming a black hole for jobs.  A bunch of calls to the cluster assembler company.  Maybe a paper book would just be a security blanket, but I'm feeling a need for one about now.

9 comments:

Unknown said...

If you've not yet worked out how to hold a job until another finisheds try this:

qsub -hold_jid <>

Or if you use -N with a unique name:

qsub -hold_jid <>

Or stick it in your script header.

However configuring SGE is certainly a bit of a black art & rather painful. And the commands that query the error state and reset the error flags don't always seem to supply enough useful information.

adam said...

SGE documentation has been difficult to find especially since Oracle starting removing old sites and urls. We posted some of our old training materials on our blog in case they might be useful.

http://bioteam.net/2009/09/sge-training-slides/

Anonymous said...

Isn't SGE essentially a "freebie?" If so, therein lies your problem. As my dad use to say, "generally, cheap things aren't good, and good things aren't cheap."

Keith Robison said...

While your father was wise to not always trust freebies, the open source movement has turned his adage on end: BLAST, HMMER, Samtools, BWA, Bowtie, Ray, R, Python, Perl, Chrome... -- my professional life is dominated by free software (and free databases)!

Anonymous said...

Keith...let's consider that "free" SGE scheduler you got from Oracle. Let's see...was the use of that cluster for running your programs which never finished and will end up needing to be re-launced free? Was/is the time free you are putting in setting this all up and then trouble shooting to find out why some things ran and others did not? Are your timelines effected by the project becoming a bit protracted due to these problems? Hmm...I suspect that the free SGE software is not so free. Maybe "free-ware" is acceptable for the academic environment but if I am working for a company where I need to get my results in the most timely manner my patience for things working sometimes and not working other times (for no apparent reason) will be VERY thin.

As for "open source" tools for your analytics I know I'll have a hard time on that one. But more often than not those tools were/are the result of some grad student or post-doc's efforts...they generally have poor documentation, need to have constant bug fixes, and are generally not well supported if/when that individualwho wrote the program leaves for a real job. I don't think you are into de novo RNAseq, but just try getting Scripture to run on that cluster of yours...good luck with that one. And there are countless other examples of open source tools being problematic for one reason or another. I'm not in any way condemning open source tool use...just saying one needs to look at their use in a realistic perspective and not be dismissive to commercial alternatives. I believe we can agree to disagree on this...but maybe not.

Keith Robison said...

That's a completely valid point; there are huge opportunity costs to buggy software. There are also huge opportunity costs involved in evaluating software.

Luckily I resolved the current problem, but I'm still mystified why whatI did worked. So the solution is either (a) find a real expert & retain them as a consultant or (b) find other software that supports cluster management AND modify the key tools to take advantage of it.

I agree there are a lot of poorly supported tools out there. Many are a grad student or post-doc's work that is no longer maintained once that student graduates. BUT, there is a lot of very good open source software out there that is maintained either by a dedicated author (MIRA is around 20 years old) or a community. Many of these programs simply do NOT have an equivalent commercial package, and commercial providers do not show a great deal of longevity.

Atop that, an awful lot of commercial bioinformatics offerings are simply wrappings of open source / academic code, with little if any value-add for most professionals, as the company is in most cases going to flip bug reports on the underlying tools back to the original authors.

Anonymous said...

I will agree with you that some of the commercial offerings are just re-wrapped public resources...but there are many which have been developed which rely on in-house proprietary algorithms. That becomes problematic for many bioinformaticians as they won't trust the results from something they cannot "peel back" and examine the guts before they even consider doing some form of validation. And then, who is to say what is right or wrong? I read the comments from Titus Brown yesterday in BioIT World about the results from the Assemblathon...oy vey! What do you believe? And all of that work was done with "open source" tools. So are those good? Are they bad? Who is right, who is wrong? Looks as if there might be just as much uncertainty in the results one gets from using open source tools as from use of a "proprietary" tool from a commercial provider. Of course, we have digressed from the theme of your original post...sorry for that. I am glad you got the problem resolved...as to how/why you got it resolved?...blame good old fashion intuition and be done with it...lol.

Adam Marko said...

Hey Keith,

Big fan of your blog. We have been using slurm for our needs, so maybe you could check that out. It is open source but you can buy commercial support, and it is in active development. I used to be an SGE fan but with the multiple forks floating around I'm not sure it is best for a corporate environment. I believe you could actually install both and migrate to slurm if you like it. The slurm mailing list is very active.

-Adam Marko(amarko@asuragen.com)

Unknown said...

Hello,

Found this blog quite by accident. It is true there are a couple of forks of Grid Engine out there but in reality there is really only one Open Source one: Son of Grid Engine (https://arc.liv.ac.uk/trac/SGE) and One commercial one Univa Grid Engine (http://www.univa.com). The others are getting quite old and not updated or supported. SGE is alive and well and works very well in a corporate environment (disclaimer: I work for Univa and we have 300+ commercial customers running UGE ). The reason I would suggest sticking with it rather than trying something else is that the tools in Bioinformatics space work quite well with SGE/UGE and 'not necessarily so well with SLURM'.