Tuesday, December 17, 2013

Assembly Could Benefit From More Circular Reasoning

It was very gratifying to get comments on my recent piece on a de novo assembly review from both a referee of the manuscript (the amazing Heng Li) as well as one of the authors of the piece (though I am truly feeling guilty I forgot to reach out to the authors).  Of course I was having my usual post-post regrets of things not written, such as the whole interesting topic of dealing with (and leveraging) uneven coverage in metagenomes and when assembling from amplified samples.  But one other thing I was reminded of is one of the minor complaints I have with assembly programs: a lack of proper handing of circular genomes.

Many prokaryotic chromosomes are circular, though linear chromosomes are found in some bacteria as well.  To muddy the waters further, there is a strong suggestion that some species with linear genomes have repeats on the ends that can recombine at times, so these genomes may be linear at times and circular at others.

Every assembler I've ever used, as far as I can tell, knows only about linear sequences.  This is understandable, as dealing with a circle is a pain -- as Jason Chin (@infoecho) put it in a Twitter conversation, computer memory isn't circular (well, not since magnetic core memory)!   Allowing for wrapping means much more complicated structures.

But, not handling circular genomes means that if you can fully assemble a genome, the assembler will break the circle in a biologically arbitrary position.  Now, this isn't frequently a problem with short read assemblies, as getting down to a single contig is rare (unless you are assembling plasmids or perhaps cosmids), but with long read sequencing enabled by the PacBio instrument it is becoming a bit of a routine occurence.  So some tools to deal with this would be useful.

First, one should check if a contig is truly circular.  For Celera assembler, there is a tool amongst the PacBio tools called CA_best_edge_to_GML.py, which generates a GML file viewable in tools such as Gephi.  Run some layout tools, and presto! - circular assemblies are circular graphs! 

Now, because of the way most (all?) assemblers work, if you get a circular assembly it probably will have a section repeated where things were broken.  So if just trivially looping it around won't construct a proper representation of the chromosome.  The size of the repeat will depend on the assembler and data, but intuitively longer reads will cause a longer repeated section.

But where to break?  It would certainly be undesirable for the break to be in the middle of a gene, operon, gene cluster, prophage or other genetic unit, as that might impede downstream analysis. Well, in bacteria a common convention is to start your circular chromosome at the origin of replication, as this is truly the pole of the chromosome from which the main replication forks originate.  A canonical replication origin has near it the dnaA gene involved in initiating replication, so the convention I approximately subscribe to is to make the first base of the start codon of the dnaA ORF the first base of the contig, with dnaA in the forward orientation.  Not all bacterial chromosomes follow this rule nor do plasmids, but it's a good start (for a review of origin finding and exceptions, see "Where does bacterial replication start? Rules for predicting the oriC region".

When I made my bold tweet, I was thinking about writing something that would take a sequence and do these necessary steps -- identify any terminal overlap, identify the dnaA gene and make the modifications.  Actually, that's not true -- when I was in twitter I forgot about the overlap step.  Then I got off thinking how to find dnaA -- probably best to predict ORFs with your favorite ORF predictor first and search those against a dnaA protein set.  In any case, I've chickened down to presenting a slightly buffed Perl program I have for doing the last step: permuting, orienting and trimming the contig.

The code requires that BioPerl and Getopt::Long be installed, but otherwise nothing much need be in place.  The basic syntax is

circ-perm.pl split=200223 max=4561237 in=orig.fa out=perm.fa 

This would trim after position 4561237 (the max parameter) and then permute it around position 200223.  If you want the sequence reverse complemented, add the -rev flag.  The in and out parameters specify the input and output files respectively; BioPerl uses file extensions to guess the format for the input, the output is always FASTA.  The id for the sequence will be appended with some additional information to show how it was transformed; there is also the option of adding a prefix to it with the pre parameter.

If the input file has multiple sequences (e.g. a chromosome plus a plasmid), then the command line above will apply the operation to each of them, which will probably fail on many since the maximum value will be longer than the sequence.  Either the parameter id=targetId can be used to specify the only sequence to modify, or a regular expression can be supplied with the reg parameter.


 #!/usr/bin/perl  
 use strict;  
 use Bio::SeqIO;  
 use Getopt::Long;  
 my $splitPoint=undef;  
 my $inFile=undef;  
 my $reverse=0;  
 my $outFile=undef;  
 my $prefix="";  
 my $maxPos=undef;  
 my $regexp=".";  
 my $targetId=undef;  
 &GetOptions('split=i'=>\$splitPoint,'in=s'=>\$inFile, 'out=s'=>\$outFile,'rev'=>\$reverse, 'reg=s'=>\$regexp,  
       'pre=s'=>\$prefix,'max=i'=>\$maxPos,'id=s'=>\$targetId);  
 $inFile=shift if (!defined $inFile && scalar(@ARGV)==1);  
 unless ( defined $splitPoint && defined $inFile && defined $outFile )  
 {  
   die "circ-perm.pl -in inFile -out outFile -split spsplitPoint (-r)\n";  
 }  
 my $rdr=new Bio::SeqIO(-file=>$inFile);  
 my $writer=new Bio::SeqIO(-format=>'fasta',-file=>">$outFile");  
 while (my $rec=$rdr->next_seq)  
 {  
   if ($rec->id=~/$regexp/ || (defined $targetId && $rec->id eq $targetId))  
   {  

     my $headSeq=$rec->subseq(1,$splitPoint);  
     $maxPos=$rec->length unless (defined $maxPos);  
     my $tailSeq=$rec->subseq($splitPoint+1,$maxPos);  
     my $revFlag=""; $revFlag=".rc" if ($reverse);  
     my $newSeq=new Bio::Seq(-id=>$prefix.$rec->id.".cp.$splitPoint$revFlag",-seq=>$tailSeq.$headSeq);  
     $newSeq=$newSeq->revcom if ($reverse);  
     $writer->write_seq($newSeq);  
   }  
   else  
   {  
     $writer->write_seq($rec);  
   }  
 }  

3 comments:

Josh said...

What were you going to say about leveraging uneven coverage in metagenomes? I agree that it's an interesting problem, and it's especially relevant if you're trying to assemble bacteria!

xico2kx said...

Some tools allow you to visually detect circularization issues in assemblies quite easily.
For instance, in this image showing the assemblies of 50+ E.coli strains, generated by slaMEM, one can easily detect that some genomes are slightly rotated (e.g. the last 3 ones, that are also in the reverse complementary strand, compared to the K-12 stain on top).

Joseph Fass said...

I'm curious if this has improved at all. I tried sprai's script to detect end-overlaps, but it didn't work for my assemblies. I've written something crude that uses the LAST aligner to detect overlap, then trims one of the copies (as your script does above, Keith, but automatically inferring the cut points), then can align corrected PacBio reads back to the trimmed / permuted sequence in order to detect if anything has gone wrong. My Perl is awful, but the script has worked in at least the few cases I've tried (BACs, bacteria with plasmids). I'd love to find out that this is an issue that's better solved elsewhere, though.