Wednesday, August 26, 2015

The Road to Hell is Paved with Bioinformatics Formats

If you really want to raise a bioinformaticist's blood pressure, loudly declare your new tool generates output in brand new data formats.  This leads to the frequent observation that a large fraction of bioinformatics work is simply converting formats. It is probably consensus that the field is awash in too many formats, though it is equally clear that we can't agree on which should survive.  Between some recent news and a Twitter thread on the subject that erupted last night, there was a bunch of fodder for me to collect in a Storify -- and to lay out my own idiosyncratic views.

For example, today Pacific Biosciences announced at their bioinformatics conference that they are moving off HDF5 for read data and will go to unaligned BAM.  For the point of view of existing bioinformatics tools, that's a win -- many tools can consume BAM.  Except, of course, most tools that take in unaligned data.  And while BAM has its merits, metadata is stored in a very simple tag-value format.  HDF5 had the problem of a lot of tools aren't well developed; one reason I tried out Julia last year was to deal with the HDF5 files from Oxford Nanopore's MinION; Perl' HDF5 library choked on them. 

One sentiment you'll find in the Storify is a hope that Oxford will abandon HDF5 - but it is certainly not my wish.  HDF5 is a sophisticated structured (and compressed) format, enabling a rich representation of metadata.  For Nanopore and probably many future sequencing technologies, converting to FASTQ or BAM would lose a lot of information.  Indeed, tools which use signal-level data from MinION are already appearing, such as nanopolish.  

Sadly, the history of bioinformatics seems to be littered with an aversion to richly-structured data formats.  NCBI tried to push ASN.1 as a format for Genbank back when I was a graduate student, but as far as I can tell it never caught on outside NCBI.  XML, which is similar, seems to have made limited headway, showing up in schemes such as Systems Biology Markup Language (SBML), but seemingly avoided as often as used.  Actually, there is a phylogenetic tree file I was playing with the other day in NeXML format, which is great -- except when I handed it off to a colleague whose tree viewer couldn't use it.

The big advantage of XML, ASN.1, HDF5, YAML is the nightmare of parsing bad formats is eliminated; there are real standards here.  Compare this with all the ways a Genbank flatfile can be botched (I see this complaint routinely on Twitter), in part because Genbank files (and I think PDB; haven't munged one in a while) retain the mainframe-era penchant for the lateral position of text being meaningful.  Format validators exist for these formats, meaning that it is straightforward to prove that a given file is legal.  Now, whether it is gibberish is a different question; NCBI had to invest a lot of time early on cleaning up bad ranges and such in the Genbank data they inherited. 

In contrast, take FASTQ format -- please. It is very simple, which is a little nice, but mostly written and read by computers, which should be able to deal with complexity.  The cost of the simple is an inability (inherited from FASTA format) for any sort of standardized structured metadata. For example, just storing what quality encoding scheme is in use would be worth a mint, let alone which platform generated the data.   FASTQ doesn't seem to break many programs -- in contrast with FASTA in which all sorts of arguments erupt over what is legal and illegal metadata encoding in the header (hint: there is no standard, so anything goes!)

Also consider this vision: a lot of high throughput sequencing consists of reading FASTQ files, making minimal changes to them (primarily trimming) and writing new FASTQ files.  Imagine if instead of keeping a trail of FASTQ files, the reads were all stored in a simple relational database (SQLite is a gem for this sort of thing) and all those transformations stored as edits or replacements of the original sequence, with the metadata trackable throughout.  Sure, those indexes and structures and metadata would consume some space, but far more would be saved.  It's (at least to me) a beautiful idea -- except no tool out there could use it.  

On the bright side, some formats have died.  When I was an undergraduate, every database and sequence handling program seemed to have its own format.  One of my first graduate school projects was a multi-format parser so I could read files in FASTA, Genbank, EMBL, SwissProt (very similar to EMBL) and GCG formats, and write FASTA, Genbank and GCG -- but that was hardly the whole the spectrum of formats in use those days (but it was the set I needed to deal with).  I wish I could say that nobody is reinventing that wheel, but at Starbase I'm frequently asking for files sent to me to be converted out of Geneious or DNA*STAR formats .  You'd think these folks would quit inventing proprietary binary formats, but noooo.  Reminiscent of ABI, which for a long time kept the binary format for Sanger tracefiles a secret, with regular changes to screw up anyone who had hacked the format. Yet another advantage to formats such as XML that come with a format definition -- it is often possible to write parsers that simply ignore the parts of the file they don't understand, so long as the XML (or ASN.1 or similar) is valid.

While I'm ranting, a related curse is that every programmer feels a need to come up with a different way of naming the parameters and a different way of collecting them in a file. I feel a little churlish complaining, as I love these three tools, but wouldn't it be nice if MIRA, Celera Assembler and SPADES had parameter file formats that were at least closely related to each other?  I really admire the effort that went into; just contemplating putting together all those config files would probably cause me to nix the project. 

So in summary, I'd vote for rich formats which are precisely defined so that the headache of parsing, mis-parsing & crashing parsers can be behind us.  Stop coming up with yet another tab-delimited mess (though I'm afraid this is a bit of a do-as-I-say-not-as-I-do -- I'm terrible about imposing such on myself).   If you start a project, try to look around for something existing to at least steal generously from, instead of inventing yet another idiosyncratic format for bioinformaticians to curse out.


Philipp Bayer said...

I've been thinking quite a lot about this problem - computer scientists "fixed" this problem for themselves by creating committees like the one that defined COBOL (CODASYL?) or the IEEE defining all these POSIX standards. As an example, similar to several FASTQS in the bioinformatics world now there were several implementations of awk, after POSIX, your awk implementation either was or wasn't POSIX compliant. With that list of rules comes automated conformance tests, compliance certifications etc.

I now think that bioinformatics needs such a committee (even though it sounds like the most boring thing possible). However, there will be problems: there's no large bioinformatics society like IEEE CS, there are no industry players pushing like IBM did with CODASYL, little engineering expertise in bioinformatics, no government involvement, a very fragmented community in terms of physical location, language barriers, no networks to spread eventual standards etc.

But if anyone wants to create such a committee feel free to contact me, I'll be more than glad to help.

Me said...

The bio world might learn a lesson from the climate community with its use of the CF metadata convention group -

The underlying data format is NetCDF or HDF5. The convention provides semantics - metadata, data structure, naming conventions, etc

Steve said...

Hey Keith

Lately I've been wondering why bioinfo tools haven't gravitated towards an application file format and maybe enhance the data with some ontologies like would be a better direction?...

I particularly like this section in the link above...

A clear, concise, and easy to understand file format is a crucial part of any application design. Fred Brooks, in his all-time best-selling computer science text, The Mythical Man-Month says:

Representation is the essence of computer programming.
Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowcharts; they'll be obvious.

Rob Pike, in his Rules of Programming expresses the same idea this way:

Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.

Linus Torvalds used different words to say much the same thing on the Git mailing list on 2006-06-27:

Bad programmers worry about the code. Good programmers worry about data structures and their relationships.
The point is this: an SQL database schema almost always does a far better job of defining and organizing the tables and data structures and their relationships. And having clear, concise, and well-defined representation almost always results in an application that performs better, has fewer problems, and is easier to develop and maintain.

Christopher O'Sullivan said...

fastq does in fact break many programs. it is not a file format. it is a human readable text dump of actual data. because it is human readable, and has no spec or syntax, humans edit it. then edited versions break programs. 'mostly written and read by computers'? when is the last time you sat down and read a million line fastq file? why should a human read/write text 'format' be read by programs?
you never mention SRA, but you imagine storing reads in relational db structure. SRA implemented this years ago. SRA format is column-oriented database allowing vertical and horizontal slicing.
regarding pacbio and hdf5: a few months ago they told me that they would include IPD and other data series in their sam/bam and promised to send me the spec. haven't seen it yet. haven't ever seen a spec for their hdf5 (which has not been very consistent) either. not looking forward to the surprise.

Geneticist from the East said...

So are there any loss of information when PacBio changes the format from HDF5 to BAM? Is this change only optional? If the loss of info makes Quiver not working as well as it was, then it will be a shame.

Keith Robison said...


Thanks so much for filling in -- if I had made a comment about SRA format, which now gets added to my "unfairly unloved clever format" list

LateHolocene said...

@GeneticistFromTheEast: no loss of data in bam and equivalent quiver output.

PacBio opted for the HDF format both for the ability to store rich data in a structured format, but also because the requirements for the output were rapidly changing and difficult to predict even months into the future. The ability to add and subtract data without rewriting an entire file ("can we get a added read group in the house"?) won steadily over compatibility with other tools.

Lack of compatibility with other tools was in the beginning (this is very early on) was more of a design than a liability. Consider the horrible results people would have if they habitually pushed pacbio bams through samtools, versus quiver, though it is surprisingly easy for the former to happen in the windowless cubicles bioinformaticians are often given.

I imagine the switch to bam isn't so much for compatibility with all other pipelines, but the 'htslib'ness of bam. For my CHM1 paper I wrote a hdf->bam->hdf converter so I could gain access to the great grouping and filtering in htslib because the support for this in hdf was not as good. htslib builds and installs in just a couple of minutes, and looks at data with the ease of 'less'.

Whatever the format, if there is good software supporting it, it will be more likely to be in use.