Thursday, April 08, 2010

Version Control Failure

Well, it was inevitable. A huge (and expensive) case of confused human genome versions.

While the human genome is 10 years old, that was hardly the final word. Lots of mopping up remained on the original project and perodically new versions would be released. The 2006 version, known as hg18, became very popular and is the default used by many sites. A later version (hg19) came out in March 2009 and are favored by other services, such as NCBI. UCSC supports all of them, but appears to have a richer set of annotation tracks for hg18. It isn't over yet: not only are there still gaps in the assembly (not to mention the centromeric badlands that are hardly represented), but with further investigation of structural variation across human populations it is likely that the reference will continue to evolve.

This is fraught with danger! Pairing an annotation track from one version with a different version results in very confusing results. Curiously, while the header line for the popular BED/BEDGRAPH formats has a number of optional fields, tagging them with version is not one of them. Software problems are one thing; doing experiments based on mismatched versions is another.

What came out in GenomeWeb's In Sequence (subscription required) is that the ABRF (a professional league of core facilities) had decided to study sequence capture methods and had chosen to test Nimblegen's & febit's array capture methods along with Agilent's in solution capture; various other technologies either weren't available or weren't quite up to their specs. I do wish ABRF had tested Agilent in both in solution and on array formats, as this would have been an interesting comparison.

What went south is that the design specification uploaded to Agilent used hg19 coordinates, but Agilent's design system (into a few days ago) uses hg18. So the wrong array was built and used to make the wrong in solution probes. So, when ABRF aligned the data, it was off. How much off depends on where you are on the chromosome: the farther down the chromosome the more likely it is to be off by a lot. ABRF apparently got a good amount of overlap, but there was an induced error.

I haven't yet found the actual study; I'm guessing it hasn't been released. If the GenomeWeb article is accurate, then it is in my opinion not kosher to grade the Agilent data according to the original experimental plan, since this wasn't followed. Either the Agilent data should be evaluated consistent to the actual design in its entirety OR the comparison of the platforms should be restricted to the regions that actually overlap between the actual Agilent design and the intended design.

In any case, I would like to put a plug in here that ABRF deposit the data in the Short Read Archive. Too few second generation sequencing datasets are ending up there, and targeted sequencing datasets would be particularly valuable from my viewpoint. Granted, a major issue is around confidentiality and donor's consents for their DNAs, which must be strictly observed. Personally, I believe if you don't deposit the dataset your paper should state this -- in other words, either way you must explicitly consider depositing and if you don't explain why it wasn't possible. The ideal of all data going into public archives was never quite perfect in the Sanger sequencing days, but we've slid far from that in the second generation world.

I had an extreme panic attack one day due to similar circumstances -- taking a quick first look at a gene of extreme interest in our first capture dataset it looked like the heavily captured region was shifted relative to my target gene -- and that my design had the same shift. Luckily, in my case it turned out to be all software -- I had mixed up versions but not in the actual design and so that part of the capture experiment was fine (I wish I could say the same about the results, but I can't really talk about them). I now make sure the genome version is part of the filename of all my BED/BEDGRAPH files to reduce the confusion and manually BLAT some key sequences to try While those are useful practices, I strongly believe that there should be a header tag which (if present) is checked on upload.


bishnu said...

I'm not sure I follow.

I'm fairly confident they did not upload hg19 genomic intervals into Agilent's system. I've done that myself and it led to complete junk, I'm extremely skeptical doing that would yield 70% coverage as reported in the article.

What seems more likely is that they took probes designed for hg19 and BLASTed them against the hg18 genome. That will mess up the spacing of probes a little, but agilent claiming that it alone accounts for the lack of coverage seems a little suspicious to me.

But yeah, either way, it's very sloppy work, and without actual study we can only speculate.

sm said...

My understanding is that agilent's hg19 version of probes is much more efficient, and they wish to use that for a better comparison!

I agree with the public datasets. Its really hard to go from one of these nextGen papers to their data!

Wei said...

I had a similar experience when trying to replicate other people's work. I found that is a pretty good tool.