Question to ponder: given a genome assembly, how much can we ascertain as to why it didn't fully assemble.
I've gotten to thinking about this after reviewing the platypus genome paper. Yeah, it's a year plus old so I'm a bit behind the times, but it's relevant to a grand series of entries that I hope to launch in the very near future. An opinion piece by Stephen J. O'Brien and colleagues in Genome Research (one catalyst for the grand series) argues that the platypus sequence assembly is far less than what is needed to understand the evolution of this curious creature and that the effort was severely hobbled by the lack of a radiation hybrid (or equivalent map).
First some basic statistics. The sequencing was nearly entirely Sanger sequencing (a tiny smidgeon of 454 reads were incorporated) yielding about 6X sequence coverage and a final assembly of 1.84Gb. Estimates of the platypus genome size can be found in the supplementary notes and are in the neighborhood of 2.35Gb. Presumably losing 0.5Gb to really hideous DNA sequences isn't too bad. An estimate based on flow cytometry put the size closer to 1.9Gb, so perhaps little if anything is missing. The 6X estimate is based on one of the larger (2.4Gb) estimates.
The first issue with the platypus assembly is the relatively low connectedness; the N50 value of only 13Kb; in other words half of the final assembly was in contigs shorter than 13Kb. In contrast, similar coverage assemblies of mouse, chimp and chicken yielded N50 values in the range of 24-38kb.
The supercontig N50 value is also low for platypus; 365Kb vs 10-13Mb for the other three genomes. One possibility explored in the paper was a relatively low amount of fosmid (~40Kb insert) data for platypus. Removing fosmids from chimp had little effect on contigs (as expected; it's not a huge contributor to the total amount of read data) but a significant effect on supercontigs, knocking the N50 from 13.7Mb to 3.3Mb -- which still about 10X better than the platypus assembly.
Looking at the biggest examples, on contigs platypus did okay (245Kb vs. 226-442 for the other species) but again on supercontigs it is small: 14Mb vs. 45-51Mb. Again, removing fosmid data from chimp hurt the assembly but the biggest chimp supercontig was still twice the size of the largest platpus supercontig.
A little bit of the trouble may have been various contaminating DNA. A number of sequences were attributable to a protozoan parasite of platypuses and some other random pools are presumably sample mistrackings at the genome center (which is no knock; no matter how good your staff & LIMS, there's a lot of plates shuffling about with Sanger genome projects).
So what went wrong? What generally goes wrong with assemblies?
The obvious explanation is repetitive elements, and platypus (despite having a smaller genome than human) appears to be no slouch in this department. But, this has an implication. If repeats are killing further assembly, then it should be true that most contigs should end in repetitive elements. Indeed, they should have a terminal stretch of repetitive stuff around the same length as the read length. I'm unaware of anyone trying to do this sort of accounting on an assembly, but I haven't really looked hard for it.
A second possibility is undersampling, either random or non-random. Random undersampling would simply mean more sequencing on the same platform would help. Non-random undersampling would be due to sequences which clone or propagate poorly in E.coli (for Sanger) or PCR amplify poorly (for 454, Illumina & SOLiD). If platypus somehow was a minefield of hard-to-clone sequences, then acquiring a lot of paired-end / mate pair data on a next gen platform (or simply lots of long 454 reads) might help matters. In the paper, only 0.04X 454 data was generated (and it isn't clear what the 454 read length was). Would piling on a lot more help?
A related third possibility is assembly rot. Imagine there are regions which can be propagated in E.coli but with a high frequency of mutations. Different clones might have different errors, resulting in a failure to assemble.
In any case, it would be great for someone out there with some spare next-gen lanes to do a run platypus. Even running one lane of paired-end Illumina would generate around 4-5X coverage of the genome for around $4K. For a little more, one could try using array-based capture to pull down fragments homologous to the ends of contigs, potentially bridging gaps. Even better would be to go whole hog with ~30X coverage from a single run for $50K (obviously not pocket change) and see how that assembly goes. Ideally a new run at the platypus would also include the other mammalian egg-layer, the echidna (well, pick one of the 4 species of them). Would it assemble any better, or are they both terrors for genome assemblers? Only the data will tell us.