Saturday, February 07, 2015

How not to write a sequence assembly comparison paper

Lex Nederbragt flagged, via Twitter, a preprint on the F1000 site with a questionable table comparing sequencing systems.  Alas, once I looked at the paper I've gotten myself in a state where only writing up its numerous deficiencies will free my mind of it.  I've even volunteered to F1000 to review the paper, but I haven't heard anything and so I will use this space.  I'm afraid this paper fall into the small category of manuscripts that I would recommend rejection.

The preprint is titled "Advantages of distributed and parallel algorithms that leverage Cloud Computing platforms for large-scale genome assembly".  Alas, the paper doesn't attempt to deliver anything of the scope promised by that, and the abstract isn't much better.  Most papers have a certain amount of preamble and then deliver some new finding; the preamble to the paper is overlong and badly executed, and the work in the paper is far too minimal and also badly executed.

A historical background trifecta: Overlong, error-rich and incomplete

The introduction gives a historical overview of sequencing, going all the way back to the early 1970s and comparing Maxam-Gilbert methods with Sanger, which already seems like a curious excess.. Perhaps this excessive intro could be excused, if it didn't frequently present debatable (if not outright wrong) points as settled facts.  

For example, it is claimed that Maxam-Gilbert sequencing was capable of sequences "usually smaller than 50 base-pairs in length"; I've deposited into Genbank hundreds of kilobases of data generated using Maxam-Gilbert sequencing with reads far longer than that.  That is followed by a claim that the Sanger method was more widely used because it leveraged the Polymerase Chain Reaction; the initial automation of Sanger by Hood and then ABI had no PCR involved, and even today one can easily run automated Sanger reactions with no use of PCR.  Perhaps the authors appear to have confused cycle sequencing, which involves linear amplification with thermostable polymerases, with PCR, which is exponential.  

We now proceed to the travesty that attracted Lex's attention: an overview of "next generation" sequencing platforms.  Lex, as many readers may know, maintains an elegant, interactive and heavily documented diagram  illustrating platform performance over time which is a marvel of lucidly packing a lot of information into a small space.  In contrast, in the preprint there is a a very ill-designed figure, which should be a table, that prints badly, with the font much too light for the unnecessarily heavy background.  Displaying a four quadrant image adds nothing; a neatly organized table could display more information in a far more readable format allowing easy comparison.  Even if the design were defensible, the content is embarassingly out-of-date, I'd estimate by close to two years. Four platforms are covered in the figure. 

 Illumina is listed as the Genome Analyzer, with a read length of 36-175 basepairs, with 40M reads per run and a cost per megabase of $2.  That cost is in the ballpark for a MiSeq run (on the open market), but that's in 2x300 paired end mode.  The existence of paired ends is in the text but not the figure.  Of course, its big brothers can deliver several orders of magnitude more data (albeit at readlengths of only 2x250 or 2x150) for a price that is roughly proportionately lower.

Next we have the 454 pyrosequencer, described as having read lengths of 200-300, which the Titanium chemistry long ago passed. Also, of course, the 454 is a dead platform, having been discontinued by Roche.  So while it holds some historical relevance, it is unlikely anyone is going to be starting a new 454 de novo project in the future.

ABI SOLiD, which has never been a strong choice for sequence assembly; I've lost track of that platform so I'm not sure if the listed read length of 50bp is correct, though clearly the 85M templates per run is rather off the current spec of 1400M.

Finally, we have the most bizarre member of the table: Helicos Helioscope.  Never a choice for assembly due to its painfully short yet error-rich reads, it's also pretty close to stone cold dead, having gone bankrupt in November 2014 (Yes, there is a company called SeqLL led by a true believer that offers services).

Note what isn't in the table: neither Ion Torrent instrument.  PacBio is missing but gets a brief mention in the text; Ion Torrent doesn't exist at all in the paper despite being used not infrequently for microbial assembly.  Oxford Nanopore, not fully released but with already demonstrated potential in de novo assembly, is absent as well.

This strange overview of platforms segues into a paragraph long description of library and template generation, which wouldn't be a big problem except what is described applies only to the clonal sequencing methods: since the authors have mentioned two different single molecule systems, it would behoove them to mention that each  represents a serious deviation from the workflow they have described.  Helicos had no ligation step, and neither used PCR.  Nor do they pull of the description correctly, placing the PCR amplification step prior to fragmentation not after ligation where it belongs.  Deviations from the original library preparation script, such as PCR-free or transposon schemes, are neglected as well.

Assembly algorithms overview:  better, but still problematic

Given the long introduction on sequencing systems, one might expect an even longer introduction to the issue of sequence assembly that the paper has advertised it is tackling.  Unfortunately, this is not the case; the two sections of the background are roughly equal.  A paragraph does a reasonable sketch of the problems faced by short read assembly in terms of errors and read lengths, though it uses the lamentable term "NGS" and again fails to call out Pacific Biosciences as being in a totally different league.  The issue of non-uniform PCR amplification is brought up, which is not trivial, but the opportunity to point out that single molecule methods avoid this is lamentably (but predictably) omitted.

What is glaringly missing is any treatment of preprocessing of the reads, which is an area of active exploration and many have found very useful.  Such upstream steps can include error correction, read merging (which, as we will see, quite possibly could have a large impact on one method) and read normalization.  

A quick overview of assembly software strategies is given, looking at Overlap/Layout/Consensus (OLC) and de Bruijn Graph (DBG) approaches. Continuing the trend of not fully covering the field, string graphs aren't mentioned, which is unfortunate as will be noted below.   A small error creeps in here: while many if not all OLC methods in the field these days use K-mers to speed overlap detection, it is not an inherent part of the method and certainly was not the case in the some of the early implementations.

Stumbling to the problem at hand

Having gone through all the preamble, the efforts of the paper are now revealed.  A single dataset, the zebrafish dataset (bizzarely described as the "fish species M. zebrafish") from Assemblathon  and roughly 197X coverage, Despite getting data from a well done comparison of assemblers, things just go downhill from here.  The dataset will be downsampled into 2X, half the data and one quarter the data, but the method for the sampling isn't given.  Later we'll learn that low quality sequences were removed from the downsampled datasets, though the definition of low quality or the method used is never described.  Statistics will be generated by a Perl script, all of which are promised to be available from the authors, which is always well meaning but nearly useless. At a minimum they should be in supplementary files; better still would be deposition in a public repository.

We now get some background on the two DBG assemblers (yes, only two) which will be used.  The golden oldie Velvet will represent all single machine DBG assemblers, whereas in the network distributed DBG corner will be Contrail, which uses Hadoop.  The authors proceed with a detailed description of each algorithm, which I'll confess I'm not sharp enough to really vet.  One important point they do point out: Contrail (at least the version they used) doesn't handle paired end information, whereas Velvet does.  As pointed out before, merging overlapping paired end reads 

Execution: about what you'd expect from what's been seen so far

Both Velvet and Contrail operate at a single k-mer value, so the authors perform the reasonable step of trying a few k-mer values with the 2X coverage dataset.  This yields the only data graph (but a well done one) of the paper, which showed an optimum at 65, based on the length of the maximum contig. No justification is given for using this as the figure-of-merit; I would tend to lean towards N50 if I had to use a single metric.  Since my favorite assembler, SPADES, uses multiple k-mers, I'm not in the habit of doing this optimization, but I do wonder if they would get a different result with a larger sample of the data.  The lack of error correction kicks in here as well; longer k-mers might be possible if errors are cleaned up first.

They then describe a number of runs with each assembler, one on each dataset (which remember, are all drawn from a single original dataset).  They make much hay of the fact that even though they used a machine with 1T of RAM, Velvet crashed on the full dataset as well as the 1/2 dataset in paired mode.  This illuminates their odd choice of Velvet to make the comparison.  While it is a golden oldie and still used heavily (though I've never been a big fan), Velvet has many competitors that are likely to be more suitable for a dataset this large.  For example, in their overview they mentioned SOAPdenovo, which advertises it is intended for large genomes.  What isn't covered anywhere are a plethora of DBG implementations better suited for this task.  For example, PASHA, ABySS and Ray are just some of the programs which can run across multiple machines, Minia is one of several DBG implementations designed to be memory efficient.  String graph approaches, which as noted above were not covered at all by the authors, also promise memory efficiency (e.g. Readjoiner). Testing metavelvet could have been interesting, since many of the problems faced in assembly of metagenomes and large genomes are the same. This is also a place where exploring more interesting subsampling of the data, such as implemented in khmer.  Also, given that the assembly worked for half the data but not on all of it, why didn't they try walking up to the failure point?

The table of results hits a pet peeve of mine: the results are in no apparent order.  I'm guessing it's the order they ran things in, which would be far less interesting than ordering them by the size of the dataset -- the list jumps from 2X to full to quarter to half.  More worrisome, they make no attempt to assess the actual quality of the assembly by comparing it to a reference or some other approach; the maximum contig size and N50 are seen as trusted figures-of-merit.  These are both useful statistics, but neither assesses quality.  As a trivial example, and I am not the first to post something like this, here is a trivial Perl program to improve both statistics for a FASTA file:

perl -nie "if  (/^>/) { $hc++; next unless ($hc==1); } print $_; " myassembly.fa

But if they do insist on using these metrics, why didn't they plot them?  Presenting this only in tabular form seems perverse next to the clean graph of the trial results, and a graph would illustrate possible trends in the data far better than the table.  For example, on the minimal (2X) dataset, Velvet in paired mode delivers a worse N50 but better maximum contig (though both are small differences).  

Overreaching on the conclusions

Having seen the best N50 and maximum contig size with Contrail on the full dataset, the authors conclude that this is the best choice for assembling large genomes.  They make an observation along the way that since Velvet did equally well with 1/2 and 1/4 the dataset, less data can be better.  This isn't wrong, but has been illustrated much better elsewhere, particularly using k-mer based subsampling.  Based on their analysis of a single dataset, handled without the current state-of-the-art in read processing and using only a single comparator program, the authors conclude that a single machine can't cope with these large datasets -- which of course Minia (and Readjoiner and probably many more) has proven otherwise. 

But in making this observation that less data worked well for Velvet, they fail to decide to run Contrail on those smaller datasets to test this point further.  Nor do they re-explore the k-mer value for Velvet; at 9 hours a run this would not have been unreasonable to experiment with.

The authors also make the curious statement that Velvet remains better for smaller genomes.  Other than showing that on their 2X dataset Contrail  delivered poorer statistics in 4X the time, they haven't at all explored smaller datasets -- let alone smaller genomes.  What frustrates me most about statements like this (well, papers like this) is that I do find the concept of Hadoop in bioinformatics intriguing, and would like to see more good explorations of it (there's only 61 entries in PubMed right now for the query "Hadoop"; perhaps my interest isn't widely shared).  Papers like this don't add anything to the discussion, and potentially harm it by besmirching the field.

There's also the overall (but unaddressed) question of whether de novo assembly of large genomes purely from small insert short read libraries is relevant or scientifically defensible.  Does anyone publish such without at least a mate pair library to scaffold them?  In my eyes, but I'm not actively in the field, momentum is rapidly shifting to assembling large genomes using long reads, using them to scaffold short read assemblies or to assemble directly from long reads, perhaps after error correction with short reads.  The forefront of assembly bioinformatics is integrating short reads, long reads and perhaps chromosome-level scaffolding approaches such as Hi-C,  with the goal of draft genomes that are nearly finished. So, the paper ultimately has the feel to me of a comparison of wooden naval vessels in the era of dreadnoughts, or a contemporary review of slide rules.  

A final curious note is the statement of author contributions: the final author got on because he "reviewed and edited the manuscript".  That's a really low bar and usually rates only an acknowledgement.  It's also not something I'd take credit for; in addition to the many issues I've illustrated above, the manuscript has more than its fair share of copy editing issues.

A Bizzare Coda (aka damn you Google!)

When writing these bits, there's often a final phase where I fret over small details, in particular if I've added enough links.  While looking up the Contrail site, a link slightly down the page was titled "Parallel and Cloud Computing Based Genome Assembly using Bi-directed String Graphs".  Since it sounded interesting, I clicked on it to I discover an abstract for a Masters Thesis from 2012 that spoke only of Velvet and Contrail, and not of string graphs: written by the lead author of the manuscript I've savaged above; supervised by the final author.  While the abstract is notably different in wording, it describes the same work as the manuscript.  Now a few things fall into place: the grossly outdated information is because this is an attempt to publish nearly 3 years later a minor work in a fast moving field, with a reworked but still inaccurate title.  


Anonymous said...

(Just a quick note: No, 50 bp for SOLiD reads isn't up to date either.)

Unknown said...

Even if they dont allow you to be an official reviewer, maybe it would be a good idea to post a link to this in the comments to the article.

Jeff said...

From now on I'm going to refer to Danio rerio as "Monsieur Zebrafish".

Anonymous said...

"Oxford Nanopore, not fully released but with already demonstrated potential in de novo assembly, is absent as well."

Very good post, except that the above statement is not honest. In the paper you linked to, Oxford Nanopore is used along with Illumina reads to do de novo assembly. If someone mixes Helicos data and Illumina data to get marginal improvement in assembly, will you say 'Helicos demonstrated potential in de novo assembly?'

AFAIK, nobody has shown how to do de novo assembly from Oxford Nanopore only.

Keith Robison said...

Yes, I should have been more accurate on Oxford Nanopore & de novo assembly ; no one has demonstrated direct de novo assembly yet, though the fact that MinION data can be used in combination with Illumina data to achieve a significantly superior assembly is a major step.

A paper published just after I wrote this from Mark Akeson's group demonstrated 2D reads with accuracy in the range of PacBio accuracy; while they did not attempt de novo assembly, it is not unreasonable to think such a paper is around the corner.