On average we each are missing 123 kb. homozygously. An incomplete genome is the norm. What a goofy species we are..
I'm horribly remiss in tracking the CNV literature, but this comment makes me wonder whether this is atypical at all. How extensively has this been profiled in other vertebrate species and how do other species look in terms of the typical amount of genome missing? I found two papers for dogs, one of which features a former lab mate as senior author and the other one has Evan Eichler in the author list. Some work has clearly been done in mouse as well.
Presumably there is some data for Drosophila, but how extensive? Are folks going through their collections of D. melanogaster collected from all of the world and looking for structural variation? With a second gen sequencer, this would be straightforward to do -- though a lot of libraries would need to be prepped! Many flies could be packed into one lane of Illumina data, so this would take some barcoding. Even cheaper might be to do it on a Polonator (reputed to cost about $500 in consumables per run (not including library prep).
Attacking this by paired-end/mate-pair NGS rather than arrays (which have been the workhorse so far) would enable detecting balanced rearrangements, which arrays are blind to (though there is another tweeted item that Eichler states "Folks you can't get this kind of information from nextgen sequencing; you need old-fashioned capillaries" -- I'd love to hear the background on that) That leads to another proto-thought: will the study of structural variation lead to better resolution of the conundrum of speciation and changes in chromosome structure -- i.e. it's easy to see how such rearrangments could lead to reproductive isolation but not easy to see how they wouldn't be sufficiently non-isolating to allow for enough founders.
Eichler's point was pretty uncontroversial - using short-read sequencing (even with paired-end approaches) there's a fair chunk of the genome that simply can't be accessed due to its extreme repetitiveness, and currently the only technology that can effectively dig into these regions is cloning and capillary sequencing. Of course, even this approach breaks down for the largest and most recent segmental duplications (i.e. those with the longest stretches of identity), which can't be easily mapped with any current technology.
ReplyDeleteI guess we'll have to wait for super-long read third-gen approaches to dig into these areas; with, say, 3000 bp single molecule reads we should be able to assemble over all but the nastiest of these regions of the genome.