Monday, December 28, 2009

Length matters!

I was looking through part of my collection of papers using Illumina sequencing and discovered an unpleasant surprise: more than one does not seem to state the read length used in the experiment. While to some this may seem trivial, I had a couple of interests. First, it's useful for estimating what can be done with the technology, and second since read lengths have been increasing it is an interesting guesstimate of when an experiment was done. Of course, there are lots of reasons to carefully pick read length -- the shorter the length, the sooner the instrument can be turned over to another experiment. Indeed, a recent paper estimates that for RNA-Seq IF you know all the transcript isoforms then 20-25 nucleotides is quite sufficient and you are interested in measuring transcript levels (they didn't, for example, discuss the ideal length for mutation/SNP discovery). Of course, that's a whopping "IF", particularly for the sorts of things I'm interested in.

Now in some cases you can back-estimate the read length using the given statistics on numbers of mapped reads and total mapped nucleotides, though I'm not even sure these numbers are reliably showing up in papers. I'm sure to some authors & reviewers they are tedious numbers of little use, but I disagree. Actually, I'd love to see each paper (in the supplementary materials) show their error statistics by read position, because this is something I think would be interesting to see the evolution of. Plus, any lab not routinely monitoring this plot is foolish -- not only would a change show important quality control information, but it also serves as an important reminder to consider the quality in how you are using the data. It's particularly surprising that the manufacturers do not have such plots prominently displayed on their website, though of course those would be suspected of being cherry-picked. One I did see from a platform supplier had a horribly chosen (or perhaps deviously chosen) scale for the Y-axis, so that the interesting information was so compressed as to be nearly useless.

I should have a chance in the very near future to take a dose of my own prescription. On writing this, it occurs to me that I am unaware of widely-available software to generate the position-specific mismatch data for such plots. I guess I just gave myself an action item!

2 comments:

Dan said...

Keith,

I completely agree with you - just yesterday I read a paper where Illumina data was analyzed, and had to dig into the supplemental information to find a table where the read length was mentioned. This should always be part of the main text of a peer-reviewed publication.

Your idea about an error-rate-by-read-position plot is a good one, though I can guess why NGS vendors aren't showing them prominently. In my opinion, every publication that uses NGS data should include a table (preferably NOT a supplemental table) that includes:
-Platform
-Protocol
-Runs/Lanes/Regions
-Read length
-Number of reads generated, mapped, and unique (after deduplication).
-Number of bases generated, mapped, and unique (after deduplication).

A table with the above information would be incredibly useful to us and others who generate and analyze massively parallel sequencing data.

Yours,

Dan Koboldt
The Genome Center at Washington University
dkoboldt [at] genome.wustl.edu
http://www.massgenomics.org

Farhat said...

I agree this is very useful information as well. About a year back, I performed this for Illumina reads with the results being summarized at http://coding.plantpath.ksu.edu/stgp/STGP_research_results.html You have to scroll down to the very bottom to get the quality information. We saw a fairly linear droppoff in quality with position.