This all started because I wanted to play around assembling some really bit Illumina datasets. Now, I've done this a lot over the last few years and always found a way. For a long while Ray was my favorite assembler, because I could successfully distribute really large jobs across many nodes of a cluster. More lately, I've discovered Megahit (arXiv, final($), github), which really does work remarkably well on big datasets on a single node. But I wanted to play around a particular tool (to remain anonymous), and discovered it is recommended to be used with 500Gb of RAM.
We have a wonderful cluster and at one point outfitted several nodes with 250Gb, but none with 500Gb. Indeed, I think the architecture can't handle it -- and even if it can, we probably don't have the right memory units. Anyway, that requires getting sysadmins and such involved -- obviously the cloud is the solution, as so often it has been.
Nope. Amazon AWS's biggest instance, several but not all of the instance types that end in 8xlarge, sport 244Mb. I googled around and there was a bit of interest on one of their support sites a few years ago, but not much else. Okay, time to start looking elsewhere -- Amazon is nice and familiar, but surely I can learn another!
Well, Google tops out just over 200Gb. IBM does too -- at least it did when I first checked; now I can't find their instance types page. Now, Rackspace does have a 500Gb machine -- if you dig hard. You won't find it in their main pricing page, but I stumbled on a blog post mentioning their 500Gb instance in their "bare metal" offering (no virtual machine). At risk of looking a gift horse in the mouth, I can't help but point out it has only 12 cores.
An awful lot of people, including myself, have gotten an awful lot of bioinformatics done with smaller machines. Indeed, many interesting tasks can run on typical desktops or laptops. Many tools are more compute-hungry than memory hungry (or at least, all-in-one-huge-chunk ravenous), and so run well on clusters. I checked with our computational chemist, and his machine is nothing special in the RAM department (16Gb).
So how much does the bioinformatics community need 500Gb RAM? A bit of Googling found a few interesting points. As expected, most papers I found mentioning such monsters were assembling short reads with de Bruijn approaches -- but not exclusively. I've also noted the author lists for these papers are enriched with people I follow on Twitter. A paper from Zamin Iqbal and colleagues genotyping large numbers of microbes reported initially needing a 300Mb de Bruijn graph, though this was pared to only 100Mb with a different implementation. A paper building a giant tree representation of gene information used a 500Gb machine, as did a presented assembly effort on tunicate genomes. The Parsnp paper from Adam Phillipy on genome comparisons used a 1Tb RAM machine, as did the BESST scaffolding paper. The winner on extravagant amounts of RAM for bioinformatics (my first two computers each started with 1 kilobyte of RAM!) appears to be Blacklight, with 16Tb in each of two partitions!
Okay, I'm envious of all that RAM -- and it is likely some other folks are as well. This red clover assembly paper reported their program failing on a 500Gb machine, as did a paper trying to enumerate repetitive elements in the human genome. Of course, it may be that more clever methodologies (such as in the first case using khmer from Titus Brown's group for digital normalization) or more clever and/or distributed data structures (for the second, given that tools such as Minia can capture human genomes in a de Bruijn graph) could save these, but that isn't a question I would be equipped to handle.
That certainly isn't a comprehensive list; Google won't find things behind paywalls and all too often machine specs are lacking from Materials and Methods sections. I also haven't attempted to track down every case the Google seems to find. But it does illustrate a demand out there.
Another indicator of interest is the number of fat machines turned up at core facilities. Nothing as packed as Blacklight, but no shortage either of 1Tb and 2Tb RAM machines. Perhaps it is my frustration talking, but I do wonder how effectively these machines are used as a group -- just as I sometimes wonder about sequencer utilization statistics. Particularly when the inputs and outputs are so transportable, I can't help but wonder if having a smaller number of centralized facilities would have gotten the funders more bang for their buck.
So, I may be learning Rackspace's ecosystem, and then deciding to live with only 12 cores. Or crossing my fingers and hoping that Amazon launches a fatter memory machine instance type -- ideally with options for more cores! But it is a useful reminder: cloud providers can be an extremely useful reservoir of compute power, but they do have limits -- and limits that are ever easier to bump into.