For those who don't spend their days with such data, a bit of a historical backdrop. When I first was exposed to bioinformatics in the early 90s, there were many formats for sequence data. Pretty much every sequence analysis suite and public database had its own distinct format. The vast majority of these were designed for human readability first and computer parsing second, and tended to have text organized in columns by the use of carefully allotted spaces. Creating files in such formats was a dicey thing, because formal specifications of the formats were not common; you generally just tried to create something that looked like what you had seen and tried it out. That worked often, but failures were rampant. One of my first graduate bioinformatics projects involved a large-scale parse of features from GenBank, and I ended up spending a lot of time detecting and curating various inconsistencies.
FASTA format was a very different beast, strongly going for simplicity. Anyone could create a file in FASTA format, as the format was simply a line with a '>' followed by the sequence identifier and optionally a description, and then the sequence on the remaining lines.
>sequenceId Description goes here
Again, formal specifications did not exist, so various programs would complain about variations within that theme. Some programs would tolerate a space between the angle bracket and the identifier; others would not. BLAST's database preparation program insisted that the sequence lines (other than the last for each sequence) have the same number of characters; other programs didn't care.
FASTA became the dominant open format for sequences, perhaps driven by BLAST using it, and perhaps by the simple fact that anyone with a text editor could create a file. When the idea of assigning quality values to DNA sequences arose (from phred, I believe), then the format was borrowed and adjusted to fit
>sequenceId Description goes here
10 20 12 13 14 18 17
Again, the trend for human readability, though it was even less likely any human would read this than the sequences. Most commonly, each sequencer trace would be converted to one FASTA file and one quality file. If a lot of files were a nuisance, then all the sequence files and all the quality files could each be concatenated into a pair of files, with care taken that the order of sequence ids was the same in each.
FASTQ was a clever leap forward, driven largely by the much larger volumes of next-gen data. FASTQ combines the two files into a single file, with alternating sequence and quality info. The really clever change was to eliminate the human readability of the quality values and instead go for a compact storage by converting each quality value to a character by a simple formula. Alas, two different encodings have been used, though only one really persists.
Okay, so far this all is good. So what is the trouble with FASTQ? The trouble is the files have a tendency to multiple like tribbles!
Take a typical Illumina run. As I get the data back from my sequence provider, I have two files per sample, one for each read (for really big runs each set is split into multiple files). Many times, these have been compressed with either zip, gzip or bzip. On the UNIX side, I always unzip the zip files, and bzip files are a rarity, but gzip files can sometimes be left zipped. The key is what program downstream is using them; many can take zipped files, others can't . Those that can differ on how they detect them; some require command-line switches and others just recognize the canonical .gz extension.
Now, some programs can use that paired end information, but those that do differ in how they want it. One set of programs demand that you give the two FASTQ files separately (with BWA this is typically in two separate command lines, one for each side; they are combined later in the process). But another set of program require you to shuffle the two together into an interleaved format, so that the first read from fragment 1 is followed by the other read from fragment 1 and so forth. Very few programs will accept either; the Ray assembler is a notable exception.
So, if you have to shuffle them you now have a second copy of all your FASTQ files. Before or after, perhaps you want to apply some trimming, filtering or error correction tools. Each of which is typically going to generate yet another version of the FASTQ files. If you are trying out various permutations or parameters of these various programs, it is not hard to accumulate a small zoo of FASTQ files. Well, not really a small zoo; they're all likely to be quite large. A few programs can at least output compressed data, but that isn't much use unless the next program will accept it.
Worse, if successive stages of your pipeline require different paired end formats, then you really have fun! For example, I was recently trying the MUSKET error corrector prior to applying the FLASH paired end combiner. MUSKET actually doesn't care about paired ends, but will only generate a single FASTQ file no matter how many inputs you give it. FLASH on the other hand needs the two reads in separate files. So, I interleaved the sequences for MUSKET, processed them and then de-interleaved them for FLASH. The first time through I generated the interleaved file but forgot to use it, meaning I fed FLASH essentially a random pairing of forward and reverse reads!! Until I'm done playing, it seems like each dataset has a growing number of slightly different FASTQs lying around. Once I try assembling them, that data ends up (one way or another) making more copies, as the read and quality data ends up in an assembly file or BAM file!
In an ideal world, perhaps I wouldn't need so many copies -- every program could accept every flavor of paired end read assortments, or better yet the reads would live once in a database and programs could extract the right version (with trimming and edits) on the fly. Of course, loading such a database might take a lot of time, and perhaps if I just settled on a pipeline I'd just clean out each intermediate once the next stage completes.
If you're not scared of being buried up to your neck in FASTQ (which, unlike tribbles, are neither cuddly nor make calming trills), Warp Drive is looking for someone like you. We're getting set to move to a larger starbase on the other side of Kendall Square, which opens the possibility of expanding our staff. We're looking for a computational biologist with experience in NGS data manipulation (ideally with a heavy emphasis on de novo assembly). Details on applying can be found on our website, though there isn't really a good job description there (yet).