Sunday, December 15, 2013

Assembling a Review of a Review of Assembling

A review on short-read de novo genome assembly appeared recently in PLoS Computational Biology, titled "Next-Generation Sequence Assembly: Four Stages of Data Processing and Computational Challenges".  I think the review has a number of merits, but I also find a number of frustrating flaws.  I'm going to write this entry much as I would have written a referee report on it.  Unfortunately, that will mean I'll dwell a bit more on the flaws than the assets, but if you are interested in the field
First, I don't want to fail to appreciate the effort that did go into this review.  The authors have taken a complex and rapidly developing topic and attempted to take a snapshot of it.  Having written a review, I know how challenging it is to be comprehensive and up-to-date.  In this regard, I believe the authors would have done themselves a service to be more explicit in stating the subject and particularly the temporal scope of the review; having a clear cut-off date helps in understanding what valuable items missed the review due to the necessary act of writing the review, not due to being inadvertently omitted.  The issues of omissions looms large in this review, and so clearer groundrules for what was included would have been helpful to avoid unfair judgments.   Some web items in the bibliography are noted as having been accessed in August 2013, so that may be a fair date.  If so, then it is surprising that they describe the read lengths on Illumina as "100 to 150 bp", given that MiSeq was delivering 250 well before that and 300 had already been announced.

Being comprehensive is very challenging, and I won't attempt to supply all the worthy programs and methods omitted from the review, but it will be worth flagging some key ones.  I'm also not volunteering to write a review in this space myself; it is a daunting task indeed! (I'd be happy to review one, or shepherd one through Briefings in Bioinformatics, where I am a mostly quiescent member of the Editorial Board).

The authors reasonably view assembly as a four stage process: Preprocessing Filtering, Graph Construction, Graph Simplification and Postprocessing filtering (generation of contigs and scaffolds).  As they note, various programs cover different stages of these, with some choosing to specialize in one area and others providing a pipeline for the entire process.  What they do cover in each section includes significant detail, and I feel is a good introduction to each subject, albeit with the caveats below which are largely issues of omission.

The pre-processing filtering section of the review has a useful overview of the most common approaches to error correction.  Unfortunately, there are several major areas of pre-processing that are completely ignored, such as end trimming and adapter removal, kmer frequency-based partitioning and normalization, paired end read joining.  The coverage of specific read correction tools is clearly incomplete, given that they left my workhorse, MUSKET, off their table.

An early complaint on Twitter is the omission of many actively developing and popular assemblers, such as IDBA, Ray, MIRA, MaSurCa, SPAdes and Minia, Conversely, the review covers a number of assemblers that were published early, but seem to be moribund and certainly not ones I see frequent questions about on SEQAnswers or elsewhere.  These omissions are particularly frustrating when the authors make sweeping statements or partition assemblers into categories.  For example, at one point they state that all de Bruijn assemblers have high memory requirements, but that ignores the Minia assembler.  Similarly, they say there are two categories of implementation: those that run on a single computing node and those which can run across multiple nodes with a shared filesystem.  Unfortunately, there is an example of a third category of assembler, the Contrail assembler which uses the Hadoop framework for processing across nodes without the necessity a common file system.

The issue of scope arises again for the coverage of long read assembly.  The PacBio2CA pipeline is discussed, but not other PacBio error correction protocols that use short reads, such as LSC, truly hybrid approaches such as Cerulean or PacBio-only assemblies schemes such as ALLORA and HGAP.

The concept of assembly using a template genome, or comparative genome assembly, is touched on, but without much depth. Another specialized form of assembly is dealing with polymorphisms non-haploid species; this isn't covered particularly at all, though I suspect this was what was in mind when a table of assemblers has a column with values either "Prokaryotic" or "Prokaryotic / Eukaryotic", which aren't obviously sensible categories.  The issue isn't prokaryote vs. eukaryote, but rather the complications of diploidy (or worse, higher ploidies).  Assemblers specifically handling the issue of polymorphism, such as Cortex, should have been mentioned.  While I'm griping about tables, the order of programs in a table seems to be utterly arbitrary rather than something obvious such as alphabetical.

The section on evaluating assemblies and assemblers for quality is quite thin, but perhaps this is an area that has particularly exploded this year; again the issue of temporal scope invades.  I also think that assembly evaluation should be considered a fifth stage of assembly, if for no other reason than to raise the degree of focus on this important topic.

One final omission to gripe about is not covering in the post-processing steps various tools which attempt to improve assemblies by polishing or removing gaps, such as Quiver (specifically for PacBio), PAGIT, GapFiller and ICORN2.  I'm not sure I'd really want to lump them into their four stages

An important idea they propose is that there would be great value in being able to mix-and-match these stages between assemblers, with a vision of a unified interface and an assembler drawing on the strengths of all the various approaches.  It's an appealing vision, but I wish they had explored it further.  Some discussion is made of file formats, for example, but only really covering what exists (FASTQ for reads, FASTG for scaffolds) without really hammering on the huge missing formats.  I realize that getting authors to agree on a common parameter manifest format is a tall order, but one can dream, can't one?  Such a step would make large scale assembler comparisons much less onerous.  Many assemblers rely on or generate graphs, but this is another Tower of Babel with too little standardization.  Even if the existing assemblers can't bring themselves to converge, shouldn't the onus lie on new developers to either pick one of the other program's ad hoc formats or find some standards, even from outside bioinformatics.

A key issue that may arise is the degree to which such a mix-and-match approach is truly feasible; will the performance hit (largely IO) of writing and reading structures outweigh the advantages of being able to assort stages at will.

So, to summarize, I think this is a useful but flawed review.  Readers will benefit from it, but unfortunately cannot rely on it to be a comprehensive snapshot of the state of short read de novo assembly in 2013.



 

5 comments:

Heng Li said...

I am a referee for this review. The flaws in the review largely reflect my ignorance in this area. Nonetheless, it should be noted that I got this review in April and the authors probably started the writeup in late 2012, when GAGE-B and this year's super stars like HGAP/SPAdes were unpublished yet. A few of tools you mentioned were just published back then or even have not been published today. While the authors could add more materials before it was accepted in July, they have done a job good enough for a publication.

On the format, the key is to have an implementation along with the spec, such that the designers can know what are important in practice and that the assembler developers can feel the efforts in conforming to a generic format paid off. Few would use SAM/VCF nowadays if there were no samtools/vcftools in the first place.

Keith Robison said...

Heng:

Thank you so much for the informative comment! That does help fix the timeframe covered much better. Given that, I think some of the areas I hit are a bit unfair (the assessment area has really blossomed this year) & that explains some of the omitted assemblers (such as MaSurCa) but not some of the others (MIRA, Ray)

I agree that having a format & the tools to drive it are really valuable; far too much time has been spent designing specs which never were used (indeed, I was once dragooned into such an effort)

Unknown said...

Thank you very much for reviewing the review and thanks for Heng for his clarification. In fact I am the corresponding author of the review and I was about to comment about the timeframe that really made the difference. However, Heng, kindly, wrote it for me.

Exactly as Heng said, we wrote this manuscript over a year ago (Nov. 2012- first draft) and it was submitted to PLoS Computational Biology early this year. The journal took quite long time to get it published. For instance, the paper was officially accepted Sep. 8th after revised submission in July, but it was published Dec. 12th. Unfortunately, PLoS CB only shows the publication date rather than showing the submission and acceptance dates. Otherwise, you could easily find that the omission is mainly because those tools were unpublished or were just published. Nevertheless, we missed some existing tools. Comprehensive surveying is really challenging in a rapidly growing area such as NGS and NGS assembly.

Thanks again for taking time reading our paper and writing such long, yet useful, review. For Heng, thanks twice, for reviewing the review and reviewing the review of the review :)

Unknown said...

Thank you very much for reviewing the review and thanks for Heng for his clarification. In fact I am the corresponding author of the review and I was about to comment about the timeframe that really made the difference. However, Heng, kindly, wrote it for me.

Exactly as Heng said, we wrote this manuscript over a year ago (Nov. 2012- first draft) and it was submitted to PLoS Computational Biology early this year. The journal took quite long time to get it published. For instance, the paper was officially accepted Sep. 8th after revised submission in July, but it was published Dec. 12th. Unfortunately, PLoS CB only shows the publication date rather than showing the submission and acceptance dates. Otherwise, you could easily find that the omission is mainly because those tools were unpublished or were just published. Nevertheless, we missed some existing tools. Comprehensive surveying is really challenging in a rapidly growing area such as NGS and NGS assembly.

Thanks again for taking time reading our paper and writing such long, yet useful, review. For Heng, thanks twice, for reviewing the review and reviewing the review of the review :)

Keith Robison said...

Mohamed:

Thank you for your response! I apologize that I meant to contact you via email & then failed to -- glad you found out anyway.

I hope you found my comments helpful for any future endeavours -- I agree that trying to catch every last tool is a tall order.

Keith