Wednesday, January 01, 2014

Envisioning The Perfect Scaffolder

Rather than make any New Year's resolutions of my own, which I would then feel guilty about not keeping, I've decided to make one for someone else: they will write the perfect open source scaffolder.  There's a lot of scaffolders out there, both stand-alone and integrated into various assemblers, but none are quite right. 

If you are sequencing an isolated bacterium or archean and are looking for a scaffolder, except in a few rare cases, you're doing something wrong: given enough long reads from PacBio it should be possible to solve nearly every bacterial genome. But, if you're sequencing eukaryotic genomes or any metagenome (or you're unlucky or data short on a simple microbial genome), you're probably in the market for one.  I'm going to supply a list of attributes I cooked up during a long drive up the Eastern Seaboard today, without much regard for feasibility or even if some conflict with each other.

Let's Agree to Agree

First, a definition so we are all on the same page.  A scaffolder takes assembled contigs and other information and attempts to order and orient those contigs within a larger framework by using some sort of constraint information.  Sometimes this is viewed as hierarchy: scaffolds and super-scaffolds, though a clear distinction between the two is unclear to me.

Any Relevant Data

Okay, the first one is a big one: I should be able to feed to the scaffolder any sort of pertinent data.  Many scaffolders will take simple paired end and mate pair short read libraries, which have tended to be the dominant form of scaffolding data.  So they are the bare minimum.  But these days there is so much more.  Low coverage PacBio data is an obvious one, leading to an expanded view of scaffolding as islands of high quality sequence connected by strands of low confidence data.  Still, low confidence beats runs of "N" any day.  It is worth noting here that current standards for recording sequence are a bit lacking: technically the Phred-style quality values only estimate the probability of a base being correctly called versus some other base, not the probability of an insertion or deletion.
But PacBio is only the beginning of interesting potential data out there.  Sequencing BAC libraries can give imperfect scaffolds, in which the order and orientation may not be known but a bunch of sequences are known to be near each other.  Long Fragment Read would presumably behave similarly.  Optical and similar (e.g. Nabsys) technologies would also generate long-range distance constraints, though a standard format for such data seems to be lacking (FASTG?).  Homology to a related organism are another class of hint, but of course a potentially troublesome one as evolution does indeed rearrange genomes.

One can also imagine more complex types of data.  For example, a number of tools will partition a metagenomic dataset using sequence composition and read counts.  That's useful, but it is easy to imagine that these partitions lack ideal precision, so treating them more probabilistically could be useful in a metagenome context (they're hints, not hard divisions)

In any case, make it completely transparent how to configure a new data type, in case technologies emerge (won't they?) which are useful but don't quite fit any of the existing boxes.

Simple & Standard Manifest Files

Both scaffolders and assemblers (and related tools, such as GapFiller), have a veritable Babel's Tower of configuration files and command line parameter conventions, despite a large number of shared concepts.  I realize that consolidating these down to single format is unrealistic, but if you're going to write a new tool, could you start with the conventions of one of the existing ones?  Manifest files are a good idea for complex configurations, and ideally an entire run should be describable by a manifest file.  I won't take a firm stand on the correct behavior when a command line switch conflicts with a manifest file value, but make it clear what happens!

Minimal Pre-processing

     I don't want to have to do a lot of pre-processing of the libraries, excepting using my favorite assembler.  So if mate pair libraries need adapters removed and various classes of insert identified, that's ideally in the scaffolder's department: it will probably have a lot more information to use for this task.

Library Parameters: Exercise for the Student

     Perhaps a corollary to the above: don't make me give you information you can figure out yourself.  I find it silly to input library size characteristics to a scaffolder, given that it can do a much better job of inferring them from the data than my lousy estimates.  Maybe I should give a file of adapters, but at least the standard ones should be built-in -- and if a sequencing technology manufacturer is Insane enough to try to block the distribution of Information on adapter sequences, shame on them!  Similarly, the degree to which a mate pair library is contaminated with ordinary inserts should be inferred, not expected to be supplied.  Of course, it is critical that the algorithm correctly estimate parameters; at least don't make mistakes that have been identified already.

Ploidy is a Feature, Not A Bug

Given that in the current world scaffolders are needed either for eukaryotic genomes or for metagenomes, which might contain eukaryotic genomes, dealing with higher ploidy is essential -- which means potentially dealing with big insertions and deletions or even rearrangements.  

Rich & Standard Outputs

Typical output for a scaffolder these days is a FASTA file of the scaffolds, with long runs of N.  Those are useful, but FASTQ is strongly preferable, especially as PacBio (or similar) scaffolding becomes common and we need to track which regions of sequence are quite iffy.  Even better, go to FASTG, which will enable a much richer set of estimates for gap lengths to be captured.

Beyond the sequence, it is critical to capture the evidence in graph format, using a standard format which can then be viewed in Gephi or Cytoscape or similar.   If the scaffolder really doesn't like some of the input assembly, it really should flag that to enable examination of the conflicts.  Finally, make sure circular scaffolds and assemblies are clearly marked.

Get It Right

There are tools such as GapFiller which attempt to correct cases in which scaffolders have inserted gaps that don't belong.  Any such tool should be part of the pipeline.  Perhaps that also suggests a degree of modularity -- which might also help with the "any data" requirement above.  Of course, this logic could be extended to saying that scaffolding is really just a phase of assembling, and the scaffolder should be integrated with the assembler. Ideally all stages of the process would be modularized to enable mixing-and-matching components, as suggested by a recent review.

Need Power? Make Sure You Can Use It!

Given all the above, this probably isn't going to run on a smart watch.  A lot of scaffolding is just aligning various data onto the input contigs.  That really should run across a cluster, if a cluster is available (and these days, aren't clusters as cheap as a bunch of Raspberry Pi and a few Legos?).  


And by the way, developers, do it quickly -- scaffolding will remain an important tool for a few more years, but at the rate sequencing technology is improving it is very unlikely to be relevant in a decade and perhaps a niche subject in less than half that. So if you resolve to fill this important gap in genome informatics, perhaps by linking together now unconnected tools, you'd better get cracking!

3 comments: said...

I think the SPAdes guys have done the best work in this respect, but someone needs to go into their code to extract out the scaffolder :)

Keith Robison said...

I used Storify to capture some of the discussion from this piece, including links to great blog entries by other authors on the topic

anton.korobeynikov said...

Well, in SPAdes scaffolder is (yet) far from being perfect. However, one of its strengths is that it is integrated with repeat resolver. And repeat resolver is capable to consume various information about genomic distances (which is converted inside assembler into the set of paths in the graph). So, such an information allows either to resolve a repeat fully, or "jump" over it and thus make a scaffold, or... nothing. but we're working on scaffolding, yes.