Wednesday, August 19, 2009

Why do genome assemblies go bad?

Question to ponder: given a genome assembly, how much can we ascertain as to why it didn't fully assemble.

I've gotten to thinking about this after reviewing the platypus genome paper. Yeah, it's a year plus old so I'm a bit behind the times, but it's relevant to a grand series of entries that I hope to launch in the very near future. An opinion piece by Stephen J. O'Brien and colleagues in Genome Research (one catalyst for the grand series) argues that the platypus sequence assembly is far less than what is needed to understand the evolution of this curious creature and that the effort was severely hobbled by the lack of a radiation hybrid (or equivalent map).

First some basic statistics. The sequencing was nearly entirely Sanger sequencing (a tiny smidgeon of 454 reads were incorporated) yielding about 6X sequence coverage and a final assembly of 1.84Gb. Estimates of the platypus genome size can be found in the supplementary notes and are in the neighborhood of 2.35Gb. Presumably losing 0.5Gb to really hideous DNA sequences isn't too bad. An estimate based on flow cytometry put the size closer to 1.9Gb, so perhaps little if anything is missing. The 6X estimate is based on one of the larger (2.4Gb) estimates.

The first issue with the platypus assembly is the relatively low connectedness; the N50 value of only 13Kb; in other words half of the final assembly was in contigs shorter than 13Kb. In contrast, similar coverage assemblies of mouse, chimp and chicken yielded N50 values in the range of 24-38kb.

The supercontig N50 value is also low for platypus; 365Kb vs 10-13Mb for the other three genomes. One possibility explored in the paper was a relatively low amount of fosmid (~40Kb insert) data for platypus. Removing fosmids from chimp had little effect on contigs (as expected; it's not a huge contributor to the total amount of read data) but a significant effect on supercontigs, knocking the N50 from 13.7Mb to 3.3Mb -- which still about 10X better than the platypus assembly.

Looking at the biggest examples, on contigs platypus did okay (245Kb vs. 226-442 for the other species) but again on supercontigs it is small: 14Mb vs. 45-51Mb. Again, removing fosmid data from chimp hurt the assembly but the biggest chimp supercontig was still twice the size of the largest platpus supercontig.

A little bit of the trouble may have been various contaminating DNA. A number of sequences were attributable to a protozoan parasite of platypuses and some other random pools are presumably sample mistrackings at the genome center (which is no knock; no matter how good your staff & LIMS, there's a lot of plates shuffling about with Sanger genome projects).

So what went wrong? What generally goes wrong with assemblies?

The obvious explanation is repetitive elements, and platypus (despite having a smaller genome than human) appears to be no slouch in this department. But, this has an implication. If repeats are killing further assembly, then it should be true that most contigs should end in repetitive elements. Indeed, they should have a terminal stretch of repetitive stuff around the same length as the read length. I'm unaware of anyone trying to do this sort of accounting on an assembly, but I haven't really looked hard for it.

A second possibility is undersampling, either random or non-random. Random undersampling would simply mean more sequencing on the same platform would help. Non-random undersampling would be due to sequences which clone or propagate poorly in E.coli (for Sanger) or PCR amplify poorly (for 454, Illumina & SOLiD). If platypus somehow was a minefield of hard-to-clone sequences, then acquiring a lot of paired-end / mate pair data on a next gen platform (or simply lots of long 454 reads) might help matters. In the paper, only 0.04X 454 data was generated (and it isn't clear what the 454 read length was). Would piling on a lot more help?

A related third possibility is assembly rot. Imagine there are regions which can be propagated in E.coli but with a high frequency of mutations. Different clones might have different errors, resulting in a failure to assemble.

In any case, it would be great for someone out there with some spare next-gen lanes to do a run platypus. Even running one lane of paired-end Illumina would generate around 4-5X coverage of the genome for around $4K. For a little more, one could try using array-based capture to pull down fragments homologous to the ends of contigs, potentially bridging gaps. Even better would be to go whole hog with ~30X coverage from a single run for $50K (obviously not pocket change) and see how that assembly goes. Ideally a new run at the platypus would also include the other mammalian egg-layer, the echidna (well, pick one of the 4 species of them). Would it assemble any better, or are they both terrors for genome assemblers? Only the data will tell us.


Guy said...

"assembly rot" ... great term for a real problem. We used to lump such events in with "contig poisoning" but your term captures the sense of accumulating over the course of the assembly

christopher said...

Another factor in assembly in these situations (i.e whole genome shotgun) is the density of polymorphisms. Any diploid organism has the potential to have two different sequences at any given position, one sequence on the paternal chromosome and a different sequence on the maternal chromosome. These differences give assemblers fits.

Ideally there is a highly inbred individual for the species of interest from which you can get your genomic DNA, this will minimize the density of polymorphisms. In this case the genomic DNA was from "captured animals", not exactly the ideal case.

Steven Salzberg said...

Wow, so much to say here and so little time. First of all, you neglect to mention the biggest "possibilities" (or variables) in assembly: (1) the assembly software used, and (2) the team of people running it. I've published multiple papers on this topic and I'm not trying to blow my own horn here, but it really, really matters.
As a recent example, see our paper in Genome Biology on the assembly of the cow genome. Using different software and a different team, we created a dramatically better assembly than the one produced by the original sequencing center (Baylor). Assemblers are very large, complex programs, and they are not just push-button packages.

The assembler used for Platypus was PCAP, which is pretty good but not the best. The paper doesn't say who ran it, or how much time they spent on it, but if you just "pushed the button" you probably wouldn't get a very good result.

You mention repetitive elements - of course we have looked at repeats on the ends of contigs. We do it all the time. I haven't looked at platypus, though, and it would take quite a bit of work to do so.

Before throwing more data at the problem - as you suggest at the end - it might be much more productive to try harder to assemble the data we have. In addition, it's important to point out that almost no one can assemble a mixture of Illumina and Sanger sequence data, so before generating anything, you'd have to make sure you use the data. The existing short-read assemblers will either (a) choke on the amount of data required for a monotreme, (b) be completely unable to merge the short read and longer read data, or (c) both.

That being said, I've never met an animal genome that we couldn't assemble pretty decently. It just takes time, sometimes lots of time.

Shiaw-Pyng said...

I did the assembly part of this platypus genome in late 2005.
It is good to know after the paper published a year, people are interested in discussion of the possible reasons for the less contiguous assembly of the platypus genome.

We started assembling the genome as early as when it had not reached the target redundancy coverage, and we knew it was more fragmented compared to all the other large scale billion bases genomes and middle scale hundreds millions bases genomes. It was probably the genome that I spent most time for assembling comparing to all the other genomes I assembled at WUGSC. The span of time period probably lasted for more than a year for looking into the assembly.

As we put in the supplemental materials, we looked in the assembly in various angles, including the genome coverage, assembly accuracy, rates of mis-assembly, read depth distribution analysis of the whole assembly, repeat content analysis, chimeric reads analysis, het. rate, cloning bias analysis , G+C content analysis , Theileria contamination, and aligned to finished BACs to look in the content of the missing gaps,..etc. The contig ends, the missing gaps regions, the lowest read depth regions are enriched for sequence with higher in G+C, those flanking regions of the gaps and gaps themselves were enriched in tandem repeats as well as in repeats bases masked

We have tried to tune the assembly iteratively based on our analysis, and comparing with other large genomes we re-assembled. It is likely to get the results different if a different team run it with different assembler. We aimed to generate a stringent assembly to begin with rather than a long contiguous one, as we used other sources of data to check and link it back. It is hard to say which assembler is really the best, we participated in the workshop of Drophila genomes assembly for various assemblers in early 2005, the PCAP version we submitted were not the longest one probably, but in the some categories such as missing kmer comparison or transcript and EST alignment comparison, the version was among the best.

I think there is still much room to improve for this assembly. I agree that it would be great that if we can add more data with current next-gen sequences to improve the platypus sequences assembly. The small amount of 454 data we talked about in the supplemental notes was from the earlier 454GS20 systems. The current next-gen platforms either Roche FLX Titanium or Solexa can generate better quality than the early system at that time. I look forward to see the platypus genome get improved in the future.

Shiaw-Pyng Yang

jcd said...

Re: christopher--diploid organisms
and polymorphism density.

Perhaps it would be interesting computationally to construct an index of polymorphism density using inbred strains vs wild types data for diploids.