Monday, May 22, 2017

What Is (and Is Not) Sequence Assembly?

In the closing talk of the pre-London Calling workshop, Hans Jansen had closed his presentation with a question whether at some future date sequence assembly would become obsolete.  This was meant to be an aspirational vision for a distance timepoint, but one correspondent on Twitter saw it as hype.  I got in a bit of a discussion, constrained by the dreaded 140 character limit, which ended up largely illustrating that I have a somewhat more restricted definition of assembly than some people.  I'm going to explore this and you can judge for yourself
The venerable approach to sequence assembly is Overlap-Layout-Consensus, or OLC.  Short reads have given us De Bruijn Graph (DBG) assemblers, which have also given native Dutch speakers much mirth as the rest of us try to say "De Bruijn". There's also string graph assemblers.  But I'm going to only speak of OLC here.

The abbreviation neatly encapsulates the method, listing the execution order (in contrast to DBG, which I wonder if it is called GDB (Graphique de Bruijn) in French-speaking countries).  These steps are:
  • Overlap: determine the graph of overlaps between reads
  • Layout: simplify the overlap graph, organizing into contigs
  • Consensus: compute the consensus for each contig

There are, of course, many variations on the theme.  Different heuristics for dealing with real world complexities, such as data size, diploidy or higher ploidies, repeats, experimental errors, etc.  

Now I will put forth several common scenarios which I would argue are not assembly.  I'll also entertain that they are, as some will feel the operations I suggest are equivalent to steps above.

Short fusion amplicons

As a first scenario, suppose you are sequencing short amplicons and have designed the sequencing primers into the amplification primers (fusion primers).  By short, I mean that the expected read length on your sequencing instrument will pass all the way through the full amplicon.

In this case, my typical analytical strategy is to quickly Partition the reads into full-length reads and partial ones.  If the sequencing libraries contain pools of amplicons, then this stage would also partition them by amplicon type (and similarly by barcode, if barcoded samples were used).  Once this is done, then each set of reads for a given amplicon partition need only be aligned to generate the consensus or call variants. 

So in this case, I have a PC algorithm -- partition and consensus.  There is certainly no layout required and everything within a partition is known to overlap.  Now, I could see an argument in which the partition step is really an overlapping.  So you might argue this is a OC assembler, but to me it is a different algorithmic class.

A variant case can be found often on Illumina with paired reads.  While the individual reads may not span the amplicon, they can be trivially overlapped in the middle to create a merged read spanning the amplicon.  Now, we did use an overlapping here, but it is a much simplified case.  Instead of comparing all-to-all, which is typical in an OLC, we've just compared two pre-defined reads.  Again, to me this isn't really assembly: assembly is hard and this isn't.

With long reads, the length of a spannable amplicon can be quite long.  Such oriented reads could also be constructed by clever restriction digestion schemes which force only one end of the fragment to be adapted.

Short non-fusion amplicons

Now suppose we didn't use fusion amplicons, but rather just threw the amplicons into a typical ligation prep.  Now we must add an additional step to orient the reads, as they could sequence in either direction.  But using the primer sequences to orient (or Flip) the fragments isn't much of an addition over using the primers to determine if the amplicon is wholly contained in a read.  So this is a simple "Partition-Flip-Consensus" or PFC algorithm.

With approaches like CATCH, long defined sequences can be quite long, as in hundreds of kilobases.  So a PFC algorithm could be quite useful.  With transposase methods, entire eukaryotic chromosomes might be possible as well, though that will depend on where the transposase inserts.


With the longboard protocol from UCSC, large numbers of sequences can be generated from BACs which contain the complete BAC sequence.  The catch is that the transposase inserts randomly within the BAC with a random orientation.  So with longboard reads, we need to detect the vector and then use it not only to flip the sequence as needed but circularly permute (Rotate) the sequence.  So we now have a PFRC algorithm.  Not assembly to me; is it to you?

As I've noted a few times recently, there is a very real possibility of using a Longboard-style approach to sequence chloroplast or plant mitochondria or even some small bacterial genomes.  This claim is based on the observation by multiple groups of fully alignable reads (what Miten Jain calls "The Robison Rule") in excess of 500kb.   So an efficient PFRC algorithm implementation could be very useful,  I've actually tried some Longboard-style data in Canu, and it actually gets a bit confused by a set of reads with a very narrow size distribution.  That's not a criticism: this isn't what Canu is designed to do.  

Okay, so that's my thinking, however mistaken, of what isn't assembly. But if you'd like to argue they are just special subcases of assembly, the comments section is now open!

1 comment:

Anonymous said...

You seem to be making the same distinction that AI researchers used to make: "If we can do it, it's not AI." Once a method was found for solving an artificial intelligence problem, the subfield was renamed (search, machine learning, ...).

All the things you described as assembly, albeit special cases that call for special, more efficient algorithms that the general case.