Tuesday, October 31, 2023

Concept: An Oxford Nanopore Adaptive Sequencing IDE

Oxford Nanopore's adaptive sequencing scheme is truly singular (but not Singular!), enabling computational adjustment of the sequencing process as it occurs.  A number of academics have demonstrated proofs-of-concept of different ways this capability can be used, with Oxford Nanopore slowly incorporating some of these higher level concepts into their MinKNOW operating software.  An idea has been rattling around in my head since London Calling that what this space truly needs is a full Integrated Development Environment (IDE) to support adaptive sequencing.  It's more than a little conceited for me to do this, given that I've contributed nothing to that field and the only adaptive sequencing experiment under my guidance was quite disappointing.  But it's an interesting enough idea that I can't resist.

An intro on adaptive sequencing for those who are unfamiliar with the idea.  The duty cycle of a pore on the ONT devices has the following phases.  First, there is a waiting phase where the pore is not engaged.  Then there is binding of an adapted DNA followed by active sequencing in which the DNA traverses from the cis to trans sides of the device.  Once sequencing is complete, the pore is back to the waiting stage.  In adaptive sequencing, bases called from the initial sequencing of a fragment can be used to determine whether to continue sequencing or alternatively the voltage is reverse for that pore only and the fragment is ejected back to the cis side. There are variations on this pattern which will be discussed later.

Adaptive sequencing is potentially advantageous if the time spent sequencing a long fragment of little interest can be diverted to trying to capture another fragment of greater interest.  Given the translocation speed of an experiment, it is useful to think of library fragments in terms of time.  For example, the current DNA chemistry operates at 450bp/s, so a kilobase fragment takes just over two seconds to traverse the pore and therefore a 100 kilobase fragment takes over three minutes.  That's a lot of time to spend on something boring if the time could be spent instead hunting for something good.  And we can also then think of a nanopore experiment having a total capability in terms of pore seconds - one pore available for one second.  Adaptive sequencing attempts to maximize the value of that fixed (though not precisely known) amount of time.

So how to decide whether to proceed or not? And how much sequence to peek at, given that the utility of adaptive is greatest if you can reliably decide as fast as possible?  One common pattern is barcode balancing, in which the barcode is read and then a decision is made to try to reduce imbalance between each barcode sample.  More on that momentarily.  Another common pattern is to use a reference database plus a set of ranges, i.e. a FASTA file and a BED file. Those ranges, ideally (though I believe not in the current ONT implementation) with polarity, define which sequences to keep and and which to reject.  BOSS-RUNS extended this concept by performing off-line computations to update the BED regions in order to progressively focus the sequencing on regions requiring more attention.  

Understanding how a given adaptive scheme will actually perform is the place for adaptive sequencing simulators, of which there are several already (e.g. Icarust).  And, if one were to implement complex logic, it is important to estimate the time penalty of that computation -- time is sequence! -- so there is a role for an adaptive sampling program profiler.

So there is a sketch of the four components of an adaptive sequencing IDE: a DSL for off-line computation and re-parameterizing the real time programs (or perhaps re-compiling one), a simulator to estimate performance and a profiler to estimate the time penalties of real time logic.  Likely the two DSLs are really within the same space -- perhaps one set of objects to represent real-time rules and everything else to support off-line decision making.  And let's not quibble about the difference between a really rich software library and a DSL.

Okay, why such complexity?  Well, I'm a control freak, that's why!  But seriously, here are some cases that illustrate why such control and flexibility might be highly desirable.

For example, when we say barcode balancing, what do we mean?  What is the goal?  If it is to achieve greatest balance, then an appropriate approach is to use early sequencing to estimate the relative share of each barcode, then as sequencing progresses we reject the overrepresented barcodes at a frequency to bring things back to balance.  That will achieve the tightest distribution, and from what I understand from a conversation with ONT that is the course they are currently pursuing.  But, in something like a set of microbial genome libraries, this approach runs the risk of ensuring that none of the libraries yield enough data to achieve genome closure.  Perhaps I'd rather each library sequence until "done" and then be deselected.  But what does "done" mean?  Perhaps it's just a total yield of basepairs.  Perhaps it's a yield of sequences over a certain length.  Or perhaps I want to progressively deselect a barcode based on the fragment sequence, keeping reads most likely to cover regions of very high interest or that might bridge gaps. 

Given everything that the ONT platform can do, once could imagine additional control might be useful.  For example, one might pause sequencing for a bit to allow a complex computation to occur that radically reprioritizes what is sequenced followed by a device restart.  Adaptive sequencing has actually been demonstrated with different rules for each pore.  While it's hard to think of a general case for that -- it makes a great demo but how to use? -- perhaps some future device would allow depositing libraries in a spatially restricted manner on the device, in which case different regions of the nanopore sensor might be productively assigned different selection schemes.  So it would be a bit of future-proofing 

Oh, I haven't checked if ONT does this, but for much of this it would be very valuable for after-action analysis to have the sequencing file record whether a sequence was truncated due to being ejected.  

Okay, now for some additional twists.  In particular, ONT is achieving higher and higher rates of duplex sequencing with duplex libraries.  In these libraries, the completion of sequencing a strand is frequently followed by entry into the pore of the complementary strand.  Since the first strand has completed, there is potentially a much richer information base for decision making -- a real possibility for example of knowing the methylation state (which could be particularly interesting to decide on in a metagenomic context), length of fragment, precisely where it maps in a reference, etc.  So the real time DSL should have a simple way to trigger a diffrent acceptance/rejection rule if the next strand is the complementary strand.

For example, if running a metagenomics experiment, perhaps we pick to duplex based on a methylation signature.  Or on ORFs detected and their protein signatures.  Or on percent G+C.   Or perhaps in a genome sequencing experiment we don't bother reading the complementary strand if we can confidently call the haplotype-defining SNPs within it.

BTW, there's a need for complementary tools for adaptive sequencing datasets.  For example, if you wish to compute copy number, then it's important to exclude any sequence after the decision point -- after all, our goal in many cases is to distort the sequence coverage over regions of interest.

ONT's proposed "outie" sequencing scheme offers options somewhat like duplex but more so.  Outie chemistry sequences a single strand but can do so repeatedly.  I think of it like saws sharpened to cut in only one direction -- on each "downstroke" from cis-to-trans no sequencing occurs but on the "upstroke" from trans-to-cis the strand is sequenced.  This can be repeated indefinitely, potentially enabling extremely high accuracy.  It also means an adaptive sequencing decision logic would have access to a full sequence, its length, methylation profile and a quality profile -- and the risk that if you don't write your code correctly a fragment will truly keep cycling until the end of the run!

What should such a DSL be written in?  Can we support more than one?  Python is an obvious choice given the widespread use and the fact it is dominant language for Oxford Nanopore's development.  But would it perform well enough in such a real time role?  Probably much of the language would really be in something more performant such as C, C++ or Rust, with Python hooks.  And versions of Python are popping up that ditch the Global Interpreter Lock (GIL) that can thwart multithreading.  But maybe it just makes more sense to use Rust as a high performance, high memory safety (and therefore more predictable) language.

Perhaps this is all a pipe dream -- particularly since I'm in no position to actually execute this grandiose vision.  But I do hope these sorts of ideas are seriously considered, as I do believe that adaptive sequencing is truly a rich field for future programming - and so deserves a well-designed, comprehensive software environment to support that.

No comments: