The impact of diploidy was something I'll confess hadn't crossed my mind; I think the topic of different assembly algorithms did briefly flit through. As a bioinformatician, I'm a bit red-faced to have ignored the impact of better algorithms. It is a bit surprising to learn that there is still a bit of to assembly.
I also wanted to throw in something else which crossed my mind, but somehow dropped out of the final cut. Now, part of the trouble is that all I have to work with here is a small blurb in some promotional material from Illumina which makes the following claim about the assembly of the panda genome
Using paired reads averaging 75 bp, BGI researchersIf this were done purely from the next-gen data it is remarkable; this is a contig N50 similar to the supercontig N50 for platypus. Is panda just an easier genome? Does the 50X oversampling (as opposed to 6X for platypus) make the difference? Or is it mostly due to the very clever current breed of algorithms. How much worse would the assembly be if the same amount of data came from unpaired reads?
generated 50X coverage of the three-gigabase genome with
an N50 contig size of ~300 kb.
With luck, once the panda genome is published all the underlying read data will be as well, which will mean many of these questions can be addressed computationally; the last three questions above are all ripe for testing.