Thursday, May 16, 2013

An Incomplete Guide to Asking for Help on Your De Novo Genome Project

I've been thinking about this piece for a while, because I am a frequent presence on and often dive into questions regarding de novo sequence assembly, particularly for small genomes.  It's good to help out and a way to feel like one is contributing to a broader community, but sometimes it can be very frustrating because the seekers (SEQers?) of help do not post their questions very well.  So, it would be helpful to have a post to point them to, though I'm sure there are considerations I either haven't thought of or will fail to remember to add.  So, those can either go into the comments or a future post, or perhaps something can go in the Wiki at SEQAnswers.

But in general, think of it this way: you have some experienced hands in a field you wish to enter, who are willing to give detailed advice for free.  But, they can't give that advice unless you specify your question well, and if you don't get it right the first time they may not see (or may ignore) your second shot.

Ask in the Right Places

There are a number of forums for discussing bioinformatics out there.  SEQAnswers is a good one; I don't frequent  but the times I've landed there it looks good.  On the other hand, the various mailing lists on LinkedIn seem to be short of experts, and I recently saw some really awful advice given out (subject perhaps for a future post), though corrected by someone else.  The approach I really can't understand is asking for help on the Biotech Rumor Mill or similar gossip sites.

Ask for Help Before You Start

It's not uncommon for posters to say they already have the data and they can't make it assemble.  What is unfortunate is that in many cases the experimental design is problematic, or at least unconventional, and if they had asked earlier then maybe they wouldn't have made certain decisions, or at least been realistic about the impact of those decisions.  I fear more than a few of these questioners are novices with hands-off PIs who have just told them to go sequence, without any strong guidance.

I'm going to attempt to say there's not one right approach to sequencing a genome, though for small genomes (< 20Mb) I'm starting to think there is pretty much one way: Pacific Biosciences long libraries at 100+X coverage.  As shown in a recent pre-print, this strategy will assemble most bacterial genomes into a very small number of contigs, perhaps one per replicon.  In my own hands, it isn't quite that good on difficult genoems, but it is the best out there.  If you are sequencing only a single small genome, Pacific Biosciences may even be cost competitive with a single MiSeq run. Clearly here I'm talking renting not buying; but there's a lot of free capacity you can rent at a modest price.  At this moment in time (May 2013), a bacterial genome can be sequenced for about $1K plus perhaps $100 of processing if you do it at Amazon (as noted by a commenter on a previous post, other platforms spit out ready-to-go data but PacBio requires a bunch of processing; don't let that scare you off!).

Illumina sequencing is still a good approach, particularly for larger genomes or if you wish to sequence multiple genomes inexpensively, but the number of contigs will be larger.  Moleculo and various mate pair strategies may have a role, but that's still being sorted out and mixing in some long PacBio reads is another approach here.  As with another company that started with 'I', nobody was ever fired for using Illumina.

Ion Torrent or Proton can get the job done cheaply, but in my hands the assemblies are lower quality and full of insertion/deletion errors (my PacBio assemblies do have a few, but nothing like Ion).  There just aren't as many assemblers tuned for this data as for Illumina, nor are there as many tools for tweaking the reads.  454 is the old standby, but I'd argue gives inferior results at a higher cost than PacBio.

So what's left?  Someone recently wanted to assemble SOLiD reads, and while there are published de novo assemblies from SOLiD that's definitely going down an unconventional path; don't expect much seasoned advice in that space.  I've never seen anyone insane enough to attempt assembly from Helicos data, but perhaps this post will lead someone to insist on it!

The other strategy choice that pains me is single end Illumina data.  Now, that might be a necessary choice sometimes for cost, but in general it is my hunch (I'm not sure there's a good study to back it up) that paired ends that are each half as long will be superior to longer single end reads.  Length matters for assembly, and you really want long -- or you need to adjust your expectations for the results.  Which is a good segue to

Why Are You Sequencing This Genome?

The science you wish to drive with a genome sequence can significantly address your choices, particularly if you are on a limited budget or sequencing multiple individuals.  Are you trying to find SNPs?  Structural variations?  Trying to count repeats?  Trying to get a crude gene census or a high resolution map?  These can all influence strategy choices.   As I mentioned above, Illumina is still the king of high-quality, low-cost sequencing IF you can batch samples appropriately.   So if you wish the sequence 20 bacterial isolates, you probably will go all Illumina.  Except, if you want high quality structural variation info, you might want to pay a bit more and go the PacBio route, sequencing one to great depth and the rest at a lower coverage (and perhaps with the longer but less accurate XL/XL chemistry) to detect structural variants -- but perhaps giving up SNP discovery.  If you are in a big genome and are most interested in polymorphic markers, RAD-Seq might be a good choice, enabling screening more lines for diversity at the expense of getting even a draft genome sequence,

By The Way, What Are You Sequencing?

It's a bit depressing how often questioners say absolutely nothing about what they are sequencing.  Now, perhaps you don't wish to tip your hand on the precise organism, but the characteristics matter.  Any estimate of genome size is valuable for estimating coverage, though after you do the sequencing you may revise that estimate and order more.  This is one more nice aspect of PacBio; the incremental cost of additional sequence is somewhere in the $300-$500 range, so one can order a number of cells to cover a low estimate of genome size, assemble that data, and then decide how much more you want.  This is in contrast to most platforms where a quantum of sequencing may be several thousand dollars or much worse.

Also critical is to know the %G+C content, which can be estimated from phylogenetic placement.  Extremes in G+C content -- at both ends -- demand PCR-free or limited PCR sample preparation methods.  Is there anything known about the repeats in the genome?  Some repeats can't be solved by any existing technology; others may require special handling (or simply throwing up your hands).  If there is an estimate of the number of chromosomes, that's valuable too.  Often questioners state how many contigs they have without pondering how many the expect to have; in some organisms 1,000 contigs might well be a finished genome!  Ploidy is critical too: haploid organisms are the easiest, diploids are common, and assembling polyploids seems to be an underexplored area of research.

Library Details Matter!

If you are planning an experiment, or have already generated data, it is important to think about library details, particularly with Illumina and PacBio.  For Illumina, there are the different library construction kits (TruSeq, Nextera and various 3rd party kits) which may influence coverage and certainly any adapter trimming strategies.  How many cycles of PCR were used?  

For PacBio, the size distribution of the DNA is important; the higher the quality of DNA the longer your libraries will be and the higher number of super-long reads you will see.

I'm less familiar with details on Ion Torrent libraries or 454 (if you really insist on paying a premium for a conservative strategy), but it is certainly worth mentioning which kit you used.

What's Your Informatics Plan? 

Another shocker: the frequency at which folks ask whether "X contigs" or "an N50 of Y" is good, without mentioning the assembler used! (or the coverage, library details, etc).  Knowing which assembler is crucial!  If you've used Ray, MIRA or CABOG I might be able to help and perhaps with velvet and a few others; on the other hand if you are using CLCBio or DNA*STAR I'll steer clear because I have no experience.

There's also the whole issue of how to pre-process reads.  I'm a bit lax about trimming adapters and haven't seen much evidence it has hurt: but have I looked carefully enough?  I am a big fan of MUSKET and FLASH for Illumina read processing, though I still debate the correct order.  Other groups love their quality-based trimming. There was a new tool out for "growing" Illumina paired end reads (combining very similar ones into superreads; a sort of pre-assembly), but I couldn't get it to work.   In any case, if you are doing these you need to mention it, as it is critical.  If you aren't, then perhaps you should look at trying some of thesse.

How Much 'ya Got?

Costs matter.  While there are some problems in genome sequencing that can't be solved at any cost (sequencing eukaryotic centromere regions, for example), for most problems it is a question of how much are you willing to spend.  If you ask "what's the best way to sequence my diploid 3Gb genome", the answer might be "build a high-quality physical map, clone everything into BACs, identify a tiling set and sequence those", but very few people are going to take that seriously due to the exorbitant cost.  One one to address the problem is tot think of the lowest cost obvious solution in your space, and then ask how it doesn't support the science you wish to do.  As noted above, Illumina paired end sequencing at 100X is a good opening bid, but the contigs will have gaps.  If that doesn't matter, then perhaps your done.  If not, then you need to figure out a good strategy to do better.

All the Stuff I Forgot

As I said at the outset, I've almost certainly forgotten something.  Indeed, I have a somewhat perverse ritual around this blog: generally after posting an entry I go for a walk with Miss Amanda, and that often spurs me to remember some gross omission.  However, I've committed to not updating posts unless they are bordering on libel, so those omissions generally remain so.  Plus, there's the things I wouldn't think of until I see an actual example, and the stuff I don't do (but probably should) that others in the field do.  So make sure you put all that in your query!


Unknown said...

Great stuff! But you should fix the following typo to prevent confusing future readers:

"At this moment in time (May 2012)..."

Keith Robison said...

Shawn: thanks for catching that! Fixed.

Nandita Mullapudi said...

Nice consolidation - another good resource is mailing lists of assembly programs such as Mira etc. If not anything, at least you find that there are many that share your pain.

Genohub said...

Excellent post. What do you think about PacBio's hierarchical genome-assembly process? They claim to achieve accurate assembly of microbial species using data from just one long read SMRT shotgun library.

Keith Robison said...

SMRT works well on very difficult genomes; if your goal is a high quality assembly for any genome less than 20Mb in size, I think it is the clear choice. I would certainly consider it for much larger, but the costs vs. ILMN start to get ugly.