First, I don't want to fail to appreciate the effort that did go into this review. The authors have taken a complex and rapidly developing topic and attempted to take a snapshot of it. Having written a review, I know how challenging it is to be comprehensive and up-to-date. In this regard, I believe the authors would have done themselves a service to be more explicit in stating the subject and particularly the temporal scope of the review; having a clear cut-off date helps in understanding what valuable items missed the review due to the necessary act of writing the review, not due to being inadvertently omitted. The issues of omissions looms large in this review, and so clearer groundrules for what was included would have been helpful to avoid unfair judgments. Some web items in the bibliography are noted as having been accessed in August 2013, so that may be a fair date. If so, then it is surprising that they describe the read lengths on Illumina as "100 to 150 bp", given that MiSeq was delivering 250 well before that and 300 had already been announced.
Being comprehensive is very challenging, and I won't attempt to supply all the worthy programs and methods omitted from the review, but it will be worth flagging some key ones. I'm also not volunteering to write a review in this space myself; it is a daunting task indeed! (I'd be happy to review one, or shepherd one through Briefings in Bioinformatics, where I am a mostly quiescent member of the Editorial Board).
The authors reasonably view assembly as a four stage process: Preprocessing Filtering, Graph Construction, Graph Simplification and Postprocessing filtering (generation of contigs and scaffolds). As they note, various programs cover different stages of these, with some choosing to specialize in one area and others providing a pipeline for the entire process. What they do cover in each section includes significant detail, and I feel is a good introduction to each subject, albeit with the caveats below which are largely issues of omission.
The pre-processing filtering section of the review has a useful overview of the most common approaches to error correction. Unfortunately, there are several major areas of pre-processing that are completely ignored, such as end trimming and adapter removal, kmer frequency-based partitioning and normalization, paired end read joining. The coverage of specific read correction tools is clearly incomplete, given that they left my workhorse, MUSKET, off their table.
An early complaint on Twitter is the omission of many actively developing and popular assemblers, such as IDBA, Ray, MIRA, MaSurCa, SPAdes and Minia, Conversely, the review covers a number of assemblers that were published early, but seem to be moribund and certainly not ones I see frequent questions about on SEQAnswers or elsewhere. These omissions are particularly frustrating when the authors make sweeping statements or partition assemblers into categories. For example, at one point they state that all de Bruijn assemblers have high memory requirements, but that ignores the Minia assembler. Similarly, they say there are two categories of implementation: those that run on a single computing node and those which can run across multiple nodes with a shared filesystem. Unfortunately, there is an example of a third category of assembler, the Contrail assembler which uses the Hadoop framework for processing across nodes without the necessity a common file system.
The issue of scope arises again for the coverage of long read assembly. The PacBio2CA pipeline is discussed, but not other PacBio error correction protocols that use short reads, such as LSC, truly hybrid approaches such as Cerulean or PacBio-only assemblies schemes such as ALLORA and HGAP.
The concept of assembly using a template genome, or comparative genome assembly, is touched on, but without much depth. Another specialized form of assembly is dealing with polymorphisms non-haploid species; this isn't covered particularly at all, though I suspect this was what was in mind when a table of assemblers has a column with values either "Prokaryotic" or "Prokaryotic / Eukaryotic", which aren't obviously sensible categories. The issue isn't prokaryote vs. eukaryote, but rather the complications of diploidy (or worse, higher ploidies). Assemblers specifically handling the issue of polymorphism, such as Cortex, should have been mentioned. While I'm griping about tables, the order of programs in a table seems to be utterly arbitrary rather than something obvious such as alphabetical.
The section on evaluating assemblies and assemblers for quality is quite thin, but perhaps this is an area that has particularly exploded this year; again the issue of temporal scope invades. I also think that assembly evaluation should be considered a fifth stage of assembly, if for no other reason than to raise the degree of focus on this important topic.
One final omission to gripe about is not covering in the post-processing steps various tools which attempt to improve assemblies by polishing or removing gaps, such as Quiver (specifically for PacBio), PAGIT, GapFiller and ICORN2. I'm not sure I'd really want to lump them into their four stages
An important idea they propose is that there would be great value in being able to mix-and-match these stages between assemblers, with a vision of a unified interface and an assembler drawing on the strengths of all the various approaches. It's an appealing vision, but I wish they had explored it further. Some discussion is made of file formats, for example, but only really covering what exists (FASTQ for reads, FASTG for scaffolds) without really hammering on the huge missing formats. I realize that getting authors to agree on a common parameter manifest format is a tall order, but one can dream, can't one? Such a step would make large scale assembler comparisons much less onerous. Many assemblers rely on or generate graphs, but this is another Tower of Babel with too little standardization. Even if the existing assemblers can't bring themselves to converge, shouldn't the onus lie on new developers to either pick one of the other program's ad hoc formats or find some standards, even from outside bioinformatics.
A key issue that may arise is the degree to which such a mix-and-match approach is truly feasible; will the performance hit (largely IO) of writing and reading structures outweigh the advantages of being able to assort stages at will.
So, to summarize, I think this is a useful but flawed review. Readers will benefit from it, but unfortunately cannot rely on it to be a comprehensive snapshot of the state of short read de novo assembly in 2013.