Comments on Omics! Omics!: At the Edge of The Cloud

If you are repeatedly doing assembly work, you wil...

2016-03-31T12:15:58.168-04:00

If you are repeatedly doing assembly work, you will see need to keep in-house high RAM server.
If that becomes a case, I can share another tip that you may find useful.

For about $1.5K, you can get HP G8 servers from ebay.

http://www.ebay.com/sch/i.html?_odkw=dl360+g8&_osacat=0&_from=R40&_trksid=p2045573.m570.l1313.TR0.TRC0.H0.TRS0&_nkw=dl360+g8&_sacat=0

These servers take 384GB-768GB ram.

http://www8.hp.com/h20195/v2/GetHTML.aspx?docname=c04123167

So, add another $1.5K and you are at 384GB.

http://www.amazon.com/Kingston-Technology-16-PC3-12800-KVR16R11D4/dp/B0088SSUTO/ref=pd_sim_sbs_147_1/175-2649101-6950441?ie=UTF8&dpID=41jo1zCXVkL&dpSrc=sims&preST=_AC_UL160_SR160%2C160_&refRID=00JAFBT80FB4NDM8ETTH

The tool may be new, but the short read assembly s...

2016-03-31T11:15:02.101-04:00

The tool may be new, but the short read assembly steps being used by it are not new. So, you have to replace the memory-inefficient steps with efficient steps.

Please email me, if you need more help.

Surprisingly, the tool I wish to play with is very...

2016-03-31T10:57:09.110-04:00

Surprisingly, the tool I wish to play with is very new, not very old -- though perhaps it has inherited inefficient structures from predecessors. Or at least not tried to use the very efficient ones developed for programs such as Minia.

"But I wanted to play around a particular too...

2016-03-31T10:55:05.717-04:00

"But I wanted to play around a particular tool (to remain anonymous), and discovered it is recommended to be used with 500Gb of RAM."

Most programs using 512 Gb RAM were written in an era, when short-read assembly was not properly understood. I do not understand why one would need 512 Gb RAM in 2016.

Hi Keith We have established the CLIMB project in...

2016-03-31T03:51:14.961-04:00

Hi Keith

We have established the CLIMB project in the UK to help solve this problem for large metagenome assemblies for microbial genomics researchers. We offer up to 3Tb RAM machines at http://www.climb.ac.uk.

Specific to UK only at the moment but we may be able to discuss collaborations if they are related to medical microbial genomics (new antibiotic discovery would count .. for example)

Best
Nick

There's a blog post describing the G series he...

2016-03-30T11:54:40.718-04:00

There's a blog post describing the G series here
https://azure.microsoft.com/en-us/blog/largest-vm-in-the-cloud/

If you go to the pricing page and select West US the pricing will show up:
https://azure.microsoft.com/en-us/pricing/details/virtual-machines/

If you have a different region, it just says the G-series is not available in this region, which is unfortunately rather unhelpful.

Ravi

Angel: Thanks -- X1 will be sweeeet when it comes...

2016-03-30T11:31:50.368-04:00

Angel: Thanks -- X1 will be sweeeet when it comes on-line - up to 2Tb of memory and 100 virtual CPUs!

It did take a little poking to find those G instances for the reason you gave -- I would prefer it show the unavailable types greyed out (or better yet, a grid which showed which types are in which regions). I must say that of these vendors, I find Amazon's consolidated instance type grid the best.

A part of the problem is that the development of n...

2016-03-30T11:09:23.245-04:00

A part of the problem is that the development of new algorithmic ideas into production-quality tools rarely takes less than 10 years. There are often many conceptual and theoretical problems to solve, before engineering and implementation issues even become the real bottleneck. If you want to solve a problem that was formulated only 5 years ago (e.g. due to new developments in sequencing technology), nobody in the world probably knows how to do it efficiently. Some people may have ideas, but investigating them will take years. Meanwhile, you're stuck in trying to solve the state-of-the-art problem with state-of-the-art hacks, which often require state-of-the-art hardware.

Communication issues are another part of the problem. Bioinformatics jargon is so different from algorithms jargon that it often becomes a major obstacle to the spread of ideas between the fields. As an algorithms researcher working with bioinformatics tool developers, I often encounter that in my work. Sometimes I follow a discussion for a long time without really understanding it. Then someone states the problem in a different way, and the problem and its existing solutions are immediately obvious.

Ravi, you beat me to it! To be fair to Keith tho...

2016-03-30T10:24:06.464-04:00

Ravi, you beat me to it!

To be fair to Keith though, the G series only shows up in the pricing page for specific regions they are deployed (https://azure.microsoft.com/en-us/regions/#services). I did not find documentation of all the possible instances. If there is such a link, that would be nice to post here.

AWS did announce the X1 series coming in "the first half of 2016" but these are not yet available. https://aws.amazon.com/blogs/aws/ec2-instance-update-x1-sap-hana-t2-nano-websites/

-angel

My University-based sequencing core recently purch...

2016-03-30T07:49:52.332-04:00

My University-based sequencing core recently purchased six 512 GB 20-cpu nodes to complement our 256 GB nodes. Our cost wasn't that bad - about $10K each - but we do have to share them with other people. I suspect that the utilization of the 512G nodes will be similar to the 256G nodes in being not very efficient and full of jobs that do not require that much memory. But when you need the memory it is nice to have. Some programs just do not scale very well yet since they are touted as the 'best' tool for the job at hand then one wants to give them a trial thus it is handy to have those larger memory machines.

I think part of the problem is the 'deeply furrowed' (as you put it in your last post) parts of bioinformatics where everyone thinks that they can write a better assembler/scaffolder/mapper/etc and while they may have a brilliant idea their implementation is lacking. Excessive or inefficient memory use, not being able to run on multiple machines and so on. They run their program on a smallish organism or with fewer reads, find out that the algorithm is "perfect", publish said program and then never polish it enough to run on bigger data sets.

Oh well, the above is an early morning before coffee rant. Now to sit down and figure out why my most recent BESST run died on one of the new 512G machines. Hope this doesn't mean I need a 1TB machine!

Ravi, Thanks for catching that - Indid look at A...

2016-03-29T22:52:16.880-04:00

Ravi,

Thanks for catching that - Indid look at Azure when researching this, but forgot about it while writing. Hadn't found that big machine though - now to look into it

Yes, I have had similar wishes while using Google ...

2016-03-29T22:34:36.666-04:00

Yes, I have had similar wishes while using Google Compute Engine. But maybe at some point the tools (e.g. read mappers) and pipelines of tools will need to be designed from the ground up to use a good multi-machine framework. i.e. the capacity to generate more reads is outstripping the point where it makes sense to try to use single machines with more and more memory. But as usual in bioinformatics, I guess it all depends on the particular problem. I have been thinking about trying some bioinformatics pipelines on CoreOS/fleet/rkt...

You missed Azure, the G series goes up to 448GiB R...

2016-03-29T22:19:28.575-04:00

You missed Azure, the G series goes up to 448GiB RAM, 32 cores, 6 TB SSD