Friday, January 22, 2016

Does any analytical program really care about the order of paired end files?

I was recently experimenting with C. Titus Brown and company's khmer package and hit an interesting little snag.  First, I had my usual problems with installing a Python-based program, which were solved by the totally counter-intuitive absurdity of actually following the installation directions precisely. Armageddon is certainly near if random shortcuts and assumptions can't be relied on to get the job done.  But once I had it working for a simple test case, I crazily tried to build something -- and that's when a new, maddening bug cropped up.


So I put together a new Perl program (yes, I'm in danger of being labeled an old dog) to feed data into khmer.  I have a handy library which knows how we store FASTQ files in the filesystem, so I can call up desired files with a few lines of code.  Then I ran into the great FASTQ Religious Divide.

Technologies such as Illumina paired ends, or mate pairs on any short read platform, create pairs of files.  One is typically labeled with "R1" in the filename and one with "R2".  The relationship between them is something you must track yourself, since FASTQ tried to be just as simpleminded as FASTA with no metadata,  though without the exculpatory background of running on machines with tens of kilobytes of RAM.  Particularly important is tracking what the relationship is between those ends and how they were generated, since paired ends generate reads that face into each other (→ mate pairs often face away from each other  (← ), though other permutations exist [corrected from original ASCII art that misfired].

Anyway, you can divide all FASTQ-consuming into 4 categories.  The first is those that know nothing and care nothing about read pairing.  We'll ignore those.  That leaves 

1) Those that want 2 paired end files, one forward and one reverse (2fers)
2) Those that require a single file, alternating between read 1 and read 2 (interlacers)
3) Those that can deal with either format (mugwumps).

For whatever reason, I've mostly dealt with 2fer programs or mugwumps.  True pain is a pipeline that alternates between 2fers and interlacers, as you must constantly convert.  A small, related pain is that many programs which filter may filter out (or filter in) one but not both reads of a pair, so often one must keep filtering files to pull out the unpaired reads and treat them separately.  This typically means various subsets, modifications and reformats of your data are cluttering your working directory, which you need during pipeline operation (and debugging/tuning) but devour filesystem space.  Makes me really long for a more relational approach to FASTQ data (which I think I've floated before) in which trims and selections are stored in tables, rather than having to change the data (yes, read correction, merging, etc would probably need new copies of the data).

Back to our story. In this case, khmer is an interlacer, so I had to deal with that.  Actually, as with many such packages in the interlacer camp, a program is provided to interlace reads, and like any proper FASTQ consumer it can deal with both compressed and uncompressed files.  I tried it at the command line and it worked.

So now I put it in my program and try to run, and the program fails out complaining that my files don't look like a proper pair.  But not on every invocation; for some of my filesets it worked and others it didn't.  I start checking things again at the command line, executing what I think the program is generating for a command line or something at least equivalent, and it works. 

Of course, the solution was for the program to actually spit out the command line it was attempting.  Which looked fine, as it had the right two files.  After much staring and grunting (apologies to those in our open office for the latter), it finally hit me -- for once I hadn't sorted the filenames, and so in a random subset the R2 files preceded the R1 file in the command line, and that khmer would not stand for.  Most of my prior code contained a sort function call on the files, and probably khmer was the first program to notice any out-of-order when I didn't.

I mentioned this on the khmer mailing list, and Titus responded that of course it did this, as why wouldn't you give R1 first and R2 second.  

So, does it (or should it) matter?  As I wrote about recently, R2 data is definitely not indistinguishable from R1 data; R2 tends to be worse and in some cases shockingly so.  So that would argue that R1 then R2 ordering might convey important information.  Conversely, quality information is captured in the quality scores, so perhaps it shouldn't matter.

But that's really being indirect.  The clear question is whether any tools out there actually treat R1 data different than R2 data.  For example, I could imagine an sequencing adapter trimmer being biased, since the adapter sequences are in a definite orientation with regard to the R1 and R2. Or perhaps a quality-based trimmer more aggressively trims R2, knowing that it tends to be worse and perhaps this isn't actually fully captured in quality scores.  Or a paired-end read merger, when all other things are equal, uses R1 vs R2 as a tie-breaker in deciding the consensus base in the overlap region.

So, are there examples out there are programs which really do treat R1 differently than R2 reads?  Beyond the khmer interlacer?







5 comments:

Guy said...

Keith, you wrote:
Particularly important is tracking what the relationship is between those ends and how they were generated, since paired ends generate reads that face into each other (--> <-- away="" each="" face="" from="" mate="" nbsp="" other="" pairs="" typically="" whereas="">).

Is the parenthetical part a typo of some sort (using left and right angle brackets as arrowheads is a real pain in html), or do you just find keeping track of the whole mate pairs/paired ends/f-r/r-f/r-r/f-f kerfuffle as infuriating as I do?

Keith Robison said...

Guy: Ugh, yes I tried to use ASCII art in a foolish way. Now to go edit that...

iceman said...

Definitely in downstream tools the read # matters, say for duplicate detection (certain types), if there are inline barcodes, or strand-specific sequencing. Adapter trimming is another one. There are others. Feed in R1 before R2 and all will be forgiven.

Unknown said...

RNAseq aligners and bisulfite-sequencing aligners might (more likely "will" for BS-seq aligners) both break if you swap read #1 and #2.

Mark said...

While in a theory there shouldn't be much difference in the quality of the R1 / R2 reads, anybody who had tried processing some MiseQ's 2x250 or 2x300 bp runs (esp from some slightly dodgy reagent batches) would tell you that the difference can be huge due to cluster expansion between cycles and other factors, which cause the the R2 reads to fall in quality much more sooner/deeper.

Quite often it is beneficial to add 20-30 cycles to R1 from R2 (if you have single end barcoding).

Storage wise - have a common repository with CONSISTENT raw fastq naming schema (can be symlinks) and have a consistent fastq_dir structure for the intermediate fastq processing results within each project.

I always call R1/R2 specifically, newer rely on a chance...

Usually different analysis like de novo, mapping, have different optimums for best input format/preprocessing.

Also not all of them take compressed files, so the use of 2-4TB drives as a local worknodes scratch disks is quite common :-)