Tuesday, June 19, 2012

Out Damned Spot Instance! Out I say!

A piece of advice for anyone in the bioinformatics world: get working knowledge of the Amazon EC2 cloud computing system.  Now, there is a lot of controversy over whether EC2 (or other cloud services) eliminate the need for a mongo local compute resource, but even if you like doing things at home there are multiple niches that Amazon can fill.  It can be your experiment sandbox, which can be quickly shut down if something goes haywire.  It can be overflow capacity, dealing with a sudden surge of compLocationute need.  It can also be a reserve for if your main system is experiencing hardware issues.  Or it can be a neutral zone for collaborating with someone outside your organization's walls.  Or it can be where you do your consulting work (approved, of course!) that is independent of your organization (or between organizations).  Or anything else; it's there and it's likely at some point you could use it, if not now then in the future.  Better yet, you can get your feet wet with their Free Tier of services, which of course is a hook to try to get you to consume more.

Don't be scared off by the wide array of different services Amazon offers.  You certainly need to understand only a few to get started, and many are really intended for e-commerce sites and the like.  I've probably used less than half a dozen services from the menu, though I'm sure there are a few more I could use profitably.

One catch with EC2 is that you are going to do a lot of low level UNIX systems administration, something I've generally avoided in my career.  I've been able to because I usually have a few UNIX gurus close enough by to do all that, and besides it's been in their job description and not mine!  The few times I have dabbled have been mixed.  At Harvard I once burned a day getting a printer back on the network, but was compensated by that lab's PI with a gift certificate for yummy bread.  On the other hand, at one of my employers I succeeded in disabling my server, which could only be restore by re-installing the OS.  Again, one reason to consider Amazon for a sandbox!

What do I mean by low level?  Well, with the nice web GUI you fire up a machine.  Note that any disk attached to that machine by default is (a) too tiny for real work and (b) will go away when you kill the machine.   If you want big, persistent storage you need to create an "EBS Volume".  With the GUI you create the volume and then attach it to the machine, but at that point it is useless.  Using low level UNIX commands you need to now format the drive, create an attach point, mount the drive and set the permissions.  If you want password-less SSH between nodes, that's a few more configuration file tweaks.  Not rocket science, but tedious to do time after time.

A past colleague and friend of mine recently let me know about STAR::Cluster, and this free software is amazing.  It automates not only the UNIX toil and trouble I found tedious, but other low level stuff I hadn't gotten around to yet.  For example, every EBS volume in the cluster is NFS-mounted to all the nodes, which is critical for some operations (though other tools, such as MIRA, are positively allergic to such setups, as the extra IO traffic kills performance).  Plus, your cluster comes loaded with useful cluster tools such as OpenMPI and the Sun Grid Engine job queuing system.

Each of these is useful for bioinformatics.  For example, OpenMPI is the framework for the nifty Ray assembler.  Ray can handle your really big de novo assembly jobs, as it allows you to spread the job out across multiple nodes.  In contrast, on Amazon you are very limited by tools such as Velvet because they can work in the memory of only a single machine, and the biggest machines at Amazon aren't very big (about 68Gb).  Celera Assembler can use the Grid Engine, which is pretty much essential with that assembler. Furthermore, under Amazon's pricing model to get big memory you must rent a lot of cores, and for a single core tool that's a bit of a waste.

So for now, I'm loving STAR::Cluster but forsaking spot clusters.  That is, until I figure out a way to divine the correct bidding strategy, which may require the services of a cauldron and some eye of newt.

STAR::Cluster has mostly behaved for me, but I have had a few hiccups in which nodes didn't quite come up as planned.  I don't know why, and in one case I reverted to doing the low level work to fix it (indeed, I finally learned how to NFS mount a volume).  In the other case, I couldn't figure out a solution and had to kill the damaged nodes.  Still, most times everything has gone as planned.

However, STAR::Cluster also tempts you with spot instances, which have not been productive for me.  Amazon's pricing is a 3-dimensional grid: where is the machine, what is its capability and which pricing scheme.  On the where side, most times you probably just want cheap, which tends to mean one of the US sites (their Asian sites are definitely about 10% more expensive to use).  It is useful to stay in one location, as only when EBS volumes are in the same zone as a compute instance can you attach (and then mount) that volume on that machine.

As noted above, capability spans a number of machine classes.  I tend to go for two of them.  The 32-bit instances are cheap (about the cost of a newspaper per day) and useful for maintaining a permanent presence for uploading & downloading files, but are under-powered for much else.  At the other end, I tend to use the premium-priced high-memory quadruple extra large instance, because this gets the most compute power and memory for the standard instances, which tends to be needed for the projects I'm offloading to Amazon like huge short read assembly or mapping efforts.  I haven't tried out the cluster compute instances yet, which are even pricier but may yield higher performance (faster networking and power) nor have I tried the GPU instances; both are likely in my future.

After these, Amazon offers three pricing schemes.  On demand instances are simple to use: you fire one up and pay for each hour you use it; make it go away and the meter stops turning (rounding up to the next hour, of course).  If you are using the system heavily, then a reserved instance involves an upfront payment but a lower per-hour cost.  Catch is, for each instance you want simultaneously you'll need to reserve another one.  The third scheme is interesting but can easily scorch your fingers: spot instances.

A spot instance is charged the current market rate for an instance of that type.  Much of the time, it's half the cost of an on demand instance, and when you have a cluster of big instances running at $45/day per node, that's not trivial.  However, you put in a bid for the maximum price you are willing to pay.  Should the spot price exceed that price, your instance can die instantly with no warning.  You can browse the prior history of a spot instance in your selected zone and get some idea, but so far I've been very unlucky.  Despite putting in spot prices well above the apparent previous price spike, new price spikes have bumped off my instances.

The big problem for me is that none of my applications can tolerate croaking in mid-operation.  Apparently there is a way to do this with Grid Engine, and apparently Ray can work off Grid Engine and probably Celera Assembler can be restarted automatically, but I'm not yet to the point of understanding how to do these.  So, having a cluster die late in a process is an expensive disaster, with the clock completely reset.  So, after multiple misadventures I've sworn off spot instances for now, which is probably costing the company significant dollars but now I'm not losing sleep -- and those aborted runs weren't free.

So STAR::Cluster lets you boil your data without a lot of toil and trouble


7 comments:

sebhtml said...

StarCluster sure looks handy !

Ray supports checkpointing (save/load your game) for unstable clusters, you may want to try that.

Adam Retchless said...

Thanks for the tips. I've been using AWS a bit but have not gotten myself over the initial learning hump. Two items may be of interest here:

1) Amazon offers AWS grants for researchers.
2) Users can share their machine images (with pre-installed software). For instance, here's one from Eric Hammond with BLAST and NMMer installed.
https://aws.amazon.com/amis/bioinformatics-image-based-on-ubuntu-7-10-gutsy-base-install-64-bit

I haven't tried anyone else's image yet, but it could be useful. I could imagine building "pipeline" images for all of the bioinformatic pipelines that have been published.

Keith Robison said...

Adam: thanks for pointing out the images, which I had left out. There are a number of interesting images out there for bioinformatics, and this is also how you can save a favorite configuration. For STAR::Cluster, you would need to build atop their image.

I think images may be tied to a region, but I haven't played with that.

Anonymous said...

For the use within bioinformatics/genomics, have you seen CloudMan as an alternative to StarCluster? It does the whole cluster management thing but also comes preconfigured with Galaxy and CloudBioLinux and thus numerous bioinformatics tools. It also allows you to customize an instance to add any tools or data that might not already be there...

Anonymous said...

We just went through a heavy aws usage period and will likely never use it again. I/O (and network) is a huge bottleneck, even (and sometimes especially) in Raid.

Hisham Eldai said...

This is an interesting suggestions, I have always played around the idea of venturing into Amazon EC2 but it has not been all that friendly to me. Now that you have delved into it, what are your comments on how I should better start.

The combination of EC2 with Star::Cluster should really be an awesome addition to the toolset, that too in a sandbox, I had to reinstall my OS on a couple of occasions, having to reinstall all the bioinformatics tools out there is not friendly at all... So thanks for this encouraging post ..

Unknown said...

We were doing great so far with “6. Put the secret on the instance from an outside source, via SCP or SSH”, but this has to change as we move to autoscaling : upon booting from a base image, we need instances to set themselves up automatically : pull the latest app code from git (using a private ssh key) + get the secrets (credentials for other services) that we don’t want to store in git, and finally start the app.

In this situation, can you see any added value in putting the secrets on S3, manage access to them using IAM instance roles and retrieving them upon app startup, rather than just putting them directly in user-data ?

aws training institute in chennai