Sunday, August 23, 2009

Genomes that begin with P: A follow-up

I'm really grateful for the comments on my bit about genome assembly, two of whom (one of which is a leading author in genome assembly) pointed out what I (the armchair genomicist) failed to cover and even better getting some first-hand information from the scientist who assembled the published platypus genome. Getting a better education in practical genomics with tuition being a bit of public embarassment; that's a sweet deal.

The impact of diploidy was something I'll confess hadn't crossed my mind; I think the topic of different assembly algorithms did briefly flit through. As a bioinformatician, I'm a bit red-faced to have ignored the impact of better algorithms. It is a bit surprising to learn that there is still a bit of to assembly.

I also wanted to throw in something else which crossed my mind, but somehow dropped out of the final cut. Now, part of the trouble is that all I have to work with here is a small blurb in some promotional material from Illumina which makes the following claim about the assembly of the panda genome
Using paired reads averaging 75 bp, BGI researchers
generated 50X coverage of the three-gigabase genome with
an N50 contig size of ~300 kb.
If this were done purely from the next-gen data it is remarkable; this is a contig N50 similar to the supercontig N50 for platypus. Is panda just an easier genome? Does the 50X oversampling (as opposed to 6X for platypus) make the difference? Or is it mostly due to the very clever current breed of algorithms. How much worse would the assembly be if the same amount of data came from unpaired reads?

With luck, once the panda genome is published all the underlying read data will be as well, which will mean many of these questions can be addressed computationally; the last three questions above are all ripe for testing.


Steven Salzberg said...

Certainly the assembly at 50X coverage will be far better than 6X coverage, but we don't know precisely how the read length differences affect this - 75bp reads, paired, at 50X could produce an excellent assembly if (1) the pairs were a sufficient distance apart and (2) the coverage was uniform.

But I think the Illumina protocol used by BGI for the panda genome had the paired-ends only about 200-300 bp apart, which will not allow them to span repeats longer than that amount. Also, we have seen biases in coverage in some projects, which means that some parts of the genome are left out (or are highly fragmented).

BGI has said that they used a new short-read de novo assembler, SOAPdenovo, for the panda genome. However, nothing is published - it's all press releases so far. So I remain skeptical of their announced N50 contig size as well as any other statistics. They need to release the panda sequence data and the assembly, and then we can evaluate it. (To their credit, they have made SOAPdenovo available for download, but only as compiled binaries.)

I'm never comfortable with "science by press release," and I'm afraid that's what the panda genome is, so far.

Keith Robison said...

Thanks! I was curious about what library size(s) were used -- a single short library would seem incapable of producing those results.

Anonymous said...

it's a bit confusing and perhaps a (rather convenient) error on their part. my own notes from AGBT as well as reports I read from PAG suggest a 300 kb N50 _scaffold_ size, and 10 kb N50 contig size.