Tuesday, February 26, 2013

Post AGBT: A Longish Item on Long Sequencing

As others have noted, a significant theme at AGBT this year was sequencing at length.  While this year lacked true bombshells, PacBio impressed many with their making single-contig bacterial genome assemblies look easy.  Moleculo had been the object of much pre-meeting excitement, and while very few additional details emerged about their process, several talks showed what could be done.  As I have discussed previously, Nabsys demonstrated their “positional sequencing” system to select invitees in a hotel suite.  Optical mapping from OpGen and BioNano Genomics featured in a few posters, but did not attract much attention.  Oxford Nanopore had no physical presence, beyond a somewhat secretive suite, but several ONT staffers were happy to reiterate their confidence that they will launch their system – when it is good and ready.
In the end, three things will drive adoption of these technologies and the extent to which each one succeeds, which I will explore in detail below.  First, there are the applications; there are different strengths and weaknesses to each, and some systems will be ill-suited for some (or completely unusable, with a lack of commercial availability being the ultimate in unusability).  Second, there are cost considerations, though none of the presenters seemed to even touch on this, leaving pundits such as myself to do back-of-envelope estimates (some of which threatening to pop eyeballs).  Finally, there are just preferences.  For example, as  noted elsewhere, Moleculo will be attractive to shops already heavily invested in Illumina, particularly if they are averse to shipping some work elsewhere.
For application space, most examples given at AGBT were either genome assembly (including gap filling and other improvements), structural variant discovery and haplotyping.  One poster showed the use of PacBio for cDNA sequencing, which will certainly be a boon to cataloging splice variants.  Metagenomics applications came up in Q&A, but I don’t believe any talks or posters actually showed this.
As far as matching technologies, it’s useful first to explore who is just plain absent and who is a pretender to the throne.  Ion Torrent simply has ignored this area, and their AGBT presentation was no different.  Rothberg’s penary talk, which apparently was a near copy of the one he gave at the earlier Ion sequencing symposium at an adjacent hotel , was big on “Moore’s Law” and enjoying the spotlight (and also inducing many eye rolls), with lots of projections of the capacity improvements coming on Proton (PGM users were only referenced in terms of number of runs; nothing was promised here) and discussion of amplicon, capture and RNA-Seq applications, but no mention of long range information.  The pretender to the throne is clearly Roche/454.  They presented one nice talk in the bioinformatics session describing valuable work closing gaps in a human cell line sequence, but cost was completely ignored.  No surprise: with 454 running north of $10K per gigabase, a 10X genome would be upwards of a quarter million dollars.  PacBio’s per gigabase cost is at worst half that – so their 10X human genome was only about $100K (apologies again for posting a much higher number on Twitter previously).  The contrast is that 454 seems stuck with incremental improvements in modal length and no significant changes in density, whereas a doubling of PacBio throughput should be rolled out this spring and perhaps another 2X squeezed out of the RS platform over the rest of the year.

Also noticeably absent was any serious mention of BGI, Complete Genomics or Complete Genomics’ Long Fragment Read technology (LFR), covered previously.  In their Nature Paper, Complete and collaborators demonstrated much longer haplotyping than is possible with Moleculo, though the underlying approaches are similar.   If BGI wants to be part of this new push for long range information, as they seem to have suggested, they need to get the merger distraction out of the way and start making it clear whether they will roll out LFR as a service (likely) or as a kit.  Nor were the cool "library-on-an-Illumina flowcell" approaches to long range information in evidence at AGBT, but that remains an interesting approach as well.

Moving to applications, for de novo assembly, my bias would be towards PacBio.  Because Moleculo performs de novo assembly, albeit on individual fragments, it can run into problems with long direct repeats and also with any extreme base bias regions which the underlying Illumina technology chokes on.  PacBio had a poster demonstrating reading through a very long VNTR in a mucin gene.  PacBio might have problems getting the exact number of bases correct on a simple repeat array, but should be able to give relatively tight bounds.  In contrast, if Moleculo must deal with a repeat array longer than the fragment size, only some guesstimation based on read depth is going to yield the number of repeats. 
In their AGBT presentation, PacBio made snapping bacterial genomes to a single contig, well, a snap. I’m in the process of testing that for myself, but if this is the case PacBio is likely to become the standard approach to high quality bacterial genomes.  Illumina will still be valuable for surveying large numbers of genomes at much lower cost, but for high resolution PacBio could rule.  However, Illumina makes some strong claims around their new Nextera mate pair kits in this space, and so there may be three grades of genomes: highly fragmented Illumina paired end, good but not single contig Nextera mate pair versions of those and finally PacBio.  If there is much cost differential, then some investigators will settle for that middle ground, which may be useful for most studies.

For other classes of ugly sequence, the two technologies are probably so close that only a very carefully designed head-to-head would flag a clear winner.  For example, such nasty regions as mammalian MHC showed up in talks, which are characterized by lots of repeats but not necessarily long simple repeat arrays. 
On the other hand, for haplotyping I suspect Moleculo will be more popular than PacBio.  First, if Illumina is to be believed there will be a sizable cost difference, with Moleculo on a human genome perhaps adding around $10K per genome to a project.  Illumina stated in their presentation that a substantial amount of haplotype information could be obtained using low coverage Moleculo, so that may be popular in studies with lots of samples.  As noted above, Moleculo may also simply be popular for those heavily invested in Illumina.

For large genomes, it appears that there will still be challenges.  That will remain the area of opportunity for mapping companies such as OpGen, BioNano Genomics and soon Nabsys. But as the long read sequencing approaches improve, they will be continually chewing upwards into the mapping companies' space.  It's a long way until all the dust settles, which means it will be an interesting space to watch for quite a while into the future.

1 comment:

Podryw said...

Z niecierpliwością oczekuję kolejnego wpisu:)