Tuesday, March 29, 2016

At the Edge of The Cloud

I've used cloud computing at Amazon Web Services (AWS) off-and-on now for over five years.  The cloud has all sorts of handy advantages -- flexible access to large amounts of compute, inexpensive access to any flavor of Linux you wish, the ability to guiltlessly kill a huge server you just fatally cratered with the wrong command.  And until now, I''ve always been able to find machines that fit my needs -- perhaps sometimes just fitting or with a bit of compromise   But, now I've hit the wall: nobody at this time offers a really serious cloud machine with 500Gb of RAM.

This all started because I wanted to play around assembling some really bit Illumina datasets.  Now, I've done this a lot over the last few years and always found a way.  For a long while Ray was my favorite assembler, because I could successfully distribute really large jobs across many nodes of a cluster.  More lately, I've discovered Megahit (arXiv, final($), github), which really does work remarkably well on big datasets on a single node.  But I wanted to play around a particular tool (to remain anonymous), and discovered it is recommended to be used with 500Gb of RAM.

We have a wonderful cluster and at one point outfitted several nodes with 250Gb, but none with 500Gb.  Indeed, I think the architecture can't handle it -- and even if it can, we probably don't have the right memory units.  Anyway, that requires getting sysadmins and such involved -- obviously the cloud is the solution, as so often it has been.

Nope.  Amazon AWS's biggest instance, several but not all of the instance types that end in 8xlarge, sport 244Mb.  I googled around and there was a bit of interest on one of their support sites a few years ago, but not much else.  Okay, time to start looking elsewhere -- Amazon is nice and familiar, but surely I can learn another! 

Well, Google tops out just over 200Gb. IBM does too -- at least it did when I first checked; now I can't find their instance types page.  Now, Rackspace does have a 500Gb machine -- if you dig hard. You won't find it in their main pricing page, but I stumbled on a blog post mentioning their 500Gb instance in their "bare metal" offering (no virtual machine). At risk of looking a gift horse in the mouth, I can't help but point out it has only 12 cores.  

An awful lot of people, including myself, have gotten an awful lot of bioinformatics done with smaller machines.  Indeed, many interesting tasks can run on typical desktops or laptops.  Many tools are more compute-hungry than memory hungry (or at least, all-in-one-huge-chunk ravenous), and so run well on clusters.  I checked with our computational chemist, and his machine is nothing special in the RAM department (16Gb).

So how much does the bioinformatics community need 500Gb RAM?  A bit of Googling found a few interesting points.  As expected, most papers I found mentioning such monsters were assembling short reads with de Bruijn approaches -- but not exclusively.  I've also noted the author lists for these papers are enriched with people I follow on Twitter.  A paper from Zamin Iqbal and colleagues genotyping large numbers of microbes reported initially needing a 300Mb de Bruijn graph, though this was pared to only 100Mb with a different implementation.  A paper building a giant tree representation of gene information used a 500Gb machine, as did a presented assembly effort on tunicate genomes.  The Parsnp paper from Adam Phillipy on genome comparisons used a 1Tb RAM machine, as did the BESST scaffolding paper. The winner on extravagant amounts of RAM for bioinformatics (my first two computers each started with 1 kilobyte of RAM!) appears to be Blacklight, with 16Tb in each of two partitions!

Okay, I'm envious of all that RAM -- and it is likely some other folks are as well.  This red clover assembly paper reported their program failing on a 500Gb machine, as did a paper trying to enumerate repetitive elements in the human genome.  Of course, it may be that more clever methodologies (such as in the first case using khmer from Titus Brown's group for digital normalization) or more clever and/or distributed data structures (for the second, given that tools such as Minia can capture human genomes in a de Bruijn graph) could save these, but that isn't a question I would be equipped to handle.

That certainly isn't a comprehensive list; Google won't find things behind paywalls and all too often machine specs are lacking from Materials and Methods sections.  I also haven't attempted to track down every case the Google seems to find. But it does illustrate a demand out there.

Another indicator of interest is the number of fat machines turned up at core facilities.  Nothing as packed as Blacklight, but no shortage either of 1Tb and 2Tb RAM machines.  Perhaps it is my frustration talking, but I do wonder how effectively these machines are used as a group -- just as I sometimes wonder about sequencer utilization statistics.  Particularly when the inputs and outputs are so transportable, I can't help but wonder if having a smaller number of centralized facilities would have gotten the funders more bang for their buck.  

So, I may be learning Rackspace's ecosystem, and then deciding to live with only 12 cores.  Or crossing my fingers and hoping that Amazon launches a fatter memory machine instance type -- ideally with options for more cores!  But it is a useful reminder: cloud providers can be an extremely useful reservoir of compute power, but they do have limits -- and limits that are ever easier to bump into.


Unknown said...

You missed Azure, the G series goes up to 448GiB RAM, 32 cores, 6 TB SSD

JD said...

Yes, I have had similar wishes while using Google Compute Engine. But maybe at some point the tools (e.g. read mappers) and pipelines of tools will need to be designed from the ground up to use a good multi-machine framework. i.e. the capacity to generate more reads is outstripping the point where it makes sense to try to use single machines with more and more memory. But as usual in bioinformatics, I guess it all depends on the particular problem. I have been thinking about trying some bioinformatics pipelines on CoreOS/fleet/rkt...

Keith Robison said...


Thanks for catching that - Indid look at Azure when researching this, but forgot about it while writing. Hadn't found that big machine though - now to look into it

Rick said...

My University-based sequencing core recently purchased six 512 GB 20-cpu nodes to complement our 256 GB nodes. Our cost wasn't that bad - about $10K each - but we do have to share them with other people. I suspect that the utilization of the 512G nodes will be similar to the 256G nodes in being not very efficient and full of jobs that do not require that much memory. But when you need the memory it is nice to have. Some programs just do not scale very well yet since they are touted as the 'best' tool for the job at hand then one wants to give them a trial thus it is handy to have those larger memory machines.

I think part of the problem is the 'deeply furrowed' (as you put it in your last post) parts of bioinformatics where everyone thinks that they can write a better assembler/scaffolder/mapper/etc and while they may have a brilliant idea their implementation is lacking. Excessive or inefficient memory use, not being able to run on multiple machines and so on. They run their program on a smallish organism or with fewer reads, find out that the algorithm is "perfect", publish said program and then never polish it enough to run on bigger data sets.

Oh well, the above is an early morning before coffee rant. Now to sit down and figure out why my most recent BESST run died on one of the new 512G machines. Hope this doesn't mean I need a 1TB machine!

Angel said...

Ravi, you beat me to it!

To be fair to Keith though, the G series only shows up in the pricing page for specific regions they are deployed (https://azure.microsoft.com/en-us/regions/#services). I did not find documentation of all the possible instances. If there is such a link, that would be nice to post here.

AWS did announce the X1 series coming in "the first half of 2016" but these are not yet available. https://aws.amazon.com/blogs/aws/ec2-instance-update-x1-sap-hana-t2-nano-websites/


Jouni said...

A part of the problem is that the development of new algorithmic ideas into production-quality tools rarely takes less than 10 years. There are often many conceptual and theoretical problems to solve, before engineering and implementation issues even become the real bottleneck. If you want to solve a problem that was formulated only 5 years ago (e.g. due to new developments in sequencing technology), nobody in the world probably knows how to do it efficiently. Some people may have ideas, but investigating them will take years. Meanwhile, you're stuck in trying to solve the state-of-the-art problem with state-of-the-art hacks, which often require state-of-the-art hardware.

Communication issues are another part of the problem. Bioinformatics jargon is so different from algorithms jargon that it often becomes a major obstacle to the spread of ideas between the fields. As an algorithms researcher working with bioinformatics tool developers, I often encounter that in my work. Sometimes I follow a discussion for a long time without really understanding it. Then someone states the problem in a different way, and the problem and its existing solutions are immediately obvious.

Keith Robison said...

Angel: Thanks -- X1 will be sweeeet when it comes on-line - up to 2Tb of memory and 100 virtual CPUs!

It did take a little poking to find those G instances for the reason you gave -- I would prefer it show the unavailable types greyed out (or better yet, a grid which showed which types are in which regions). I must say that of these vendors, I find Amazon's consolidated instance type grid the best.

Unknown said...

There's a blog post describing the G series here

If you go to the pricing page and select West US the pricing will show up:

If you have a different region, it just says the G-series is not available in this region, which is unfortunately rather unhelpful.


Nick Loman said...

Hi Keith

We have established the CLIMB project in the UK to help solve this problem for large metagenome assemblies for microbial genomics researchers. We offer up to 3Tb RAM machines at http://www.climb.ac.uk.

Specific to UK only at the moment but we may be able to discuss collaborations if they are related to medical microbial genomics (new antibiotic discovery would count .. for example)


homolog.us said...

"But I wanted to play around a particular tool (to remain anonymous), and discovered it is recommended to be used with 500Gb of RAM."

Most programs using 512 Gb RAM were written in an era, when short-read assembly was not properly understood. I do not understand why one would need 512 Gb RAM in 2016.

Keith Robison said...

Surprisingly, the tool I wish to play with is very new, not very old -- though perhaps it has inherited inefficient structures from predecessors. Or at least not tried to use the very efficient ones developed for programs such as Minia.

homolog.us said...

The tool may be new, but the short read assembly steps being used by it are not new. So, you have to replace the memory-inefficient steps with efficient steps.

Please email me, if you need more help.

homolog.us said...

If you are repeatedly doing assembly work, you will see need to keep in-house high RAM server.
If that becomes a case, I can share another tip that you may find useful.

For about $1.5K, you can get HP G8 servers from ebay.


These servers take 384GB-768GB ram.


So, add another $1.5K and you are at 384GB.