Friday, July 24, 2020

Two Pandemic-Related Programming Problems

I will offer here two bioinformatics programming problems which I think are interesting, useful and should be approachable by an advanced undergraduate.  For a variety of reasons I've been thinking a lot of about skill levels and how to assess them.  One key reason is we have two open slots in our group, so I'm plowing through CVs and engaging in the usual hiring funnel struggle -- how do you winnow CVs to phone screens and then down to interviews? We also thought we might, but now won't, bring on a one year intern.  But I'm also trying to take a look at my own skill set with a critical eye.  Plus I maintain a Quora addiction, and you see there people looking for ways to prove their computational biology chops.
I checked yesterday and the GISAID database is up to 65 thousand SARS-CoV-2 sequences.  Yow!  The metadata for breaking down how those were acquired isn't very good, but there have been a lot of methods.  At one point we hoped to be a major contributor, but for a variety of reasons the method development resources have been shifted to our Concentric by Ginkgo testing scheme.  I do usually try to avoid plugging my current employer, but I've sunk about 1000 hours into this (and the meter is still running) so I need the redemption, plus one lesson from our genome sequencing effort is it is possible to overbuild a capacity viz-a-viz the market demand.  Still, we've generated hundreds of genomes which are now in GISAID and NextStrain, which feels good (though not the bazillions we aimed for).

While I can't put a hard number on it, it is clear that the vast majority of SARS-CoV-2 sequences have been generated by tiled amplicon strategies, with a minority from methods such as shotgun metagenomics.  The most popular tiled amplicon strategy is ARTIC, which has 98 amplicons that are approximately 400 bases in size.  There are other designs out there which have longer amplicons or shorter amplicons, but the general scheme is nearly always the same. The amplicons are divided into two pools, with the amplicons in one pool lapping the ones in the other pool much like two rows of bricks in a wall.  In this way, nearly the entire virus can be sequenced -- there's little bits at both ends of the virus that are missed.  

How you go from amplified material to sequence data depends on the platform you've chosen and then a few other variables.  For Nanopore and ARTIC, the most common approach is to ligate adapters and sequence the amplicons whole.  Particularly for longer amplicon schemes, another Nanopore scheme is to use the rapid kit to tagment the amplicons; this is much simpler with less hands on time than ligation.  Actually, in that danger zone where the personal belief arises that even I could run the protocol end-to-end.  If targeting PacBio, which is much less common but is out there (indeed, if you look on GISAID I think you can find every sequencing technology except Helicos and 454, including some obscure Chinese boxes), typically it means really long amplicons (which means high quality RNA and very good reverse transcription conditions) of 5 or even 10 kilobases.  

It isn't uncommon to target Illumina instruments.  Indeed, Illumina now has an Emergency Use Authorization for a test which primarily detects the virus but can also generate viral genome sequences.  With Illumina in addition to ligating adapters or tagmenting, you have the option of tailing the primers with standard "i5" and "i7" tags so that the amplicons can be barcoded and read in the sequencer directly.  This is very streamlined, though it has the drawback with ARTIC of locking you into the three long chemistries -- 2x250 or 2x300 on MiSeq or 2x250 on NovaSeq SP -- with long instrument dwell times.  There is a commercial kit, from Paragon Genomics, that has tailed amplicons designed for the ubiquitous 2x150 chemistry available for any flowcell -- it is impressive that they get the much higher multiplex depth to work (they also have a magic cleanup reagent to help with this).   Tagmentation is very popular for Illumina since you can use ARTIC with any readlength and it is a broadly used and relatively simple method.  It is the use of tagmentation that drives my two problems.

Ground Rules

I'm going to try to be maximally descriptive and minimally prescriptive with these problems.  I would say that if I were grading them one thing I would look at is the use of standard tools and libraries versus wheel reinvention.  I'm generally flexible on language choice; Python or R would be my first choices, something like Scala or Julia would be good.  I'll confess a dislike of PHP and have a degree of animus towards Perl due to being a recovering Perl programmer.  If you submitted to me this in COBOL, make sure I can see your grinning face as you do it!  And while such a code example might be part of evaluating someone for a position, it is neither necessary nor sufficient.

On the other hand, if packages to do these already exist, just submitting that isn't really a good demonstration.  There must be some work to show!  If such packages don't exist, then I'd be far happier hearing of it being in GitHub than getting it as a private message.

I'd be happy to take feedback in the comments or on Twitter as to whether these really are undergraduate-level problems.  Maybe they're much too hard and require too much background knowledge to be considered that.

Problem 1: Specialized Read Trimming

Problem number one is a specialized read trimming issue.  The typical scheme is to amplify the two pools for one sample and then pool by sample prior to tagmenting; in this way all the reads for a given sample are guaranteed to have the same barcode, plus there's half as many samples to tagment, PCR barcode, normalize (if you do that) and such.

But this presents a problem.  For consensus generation and variant calling you don't want to have any primer-derived or transposon-derived sequence sneak into the calculation.  If you were sequencing the amplicons intact, then it is easy to recognize primer sequences.  Or if you tailed the primers with i5 and i7 to improve coverage at the ends, though in our experience this resulted in overrepresentation of the ends -- probably there should be a cocktail of tailed and untailed primers with the ratio between them worked out.  Anyways, an important case to handled are reads that result from a transposon leap into the primer region; these will have only a fragment of the primer and so if you aren't careful you won't recognize the primer-derived sequence.

The solution is to use the span of a read to essentially map it to the actual amplicon it is derived from.  If it can be definitively assigned to an amplicon, then any primer regions that amplicon spans can be left in the sequence and the primer regions for that amplicon definitely trimmed out.  If it cannot be assigned to an amplicon, then any potential primer derived sequences must be removed.  

Problem 2: Deletion Variant Data Generator

When tagmentation of ARTIC works well, it works crazy well; the coverage can be insane.  But with poor quality or low viral load samples, there can be dropouts of data.  Balancing amplicons in a multiplex PCR is very tricky and while ARTIC has done a remarkable job, there is still significant variability.  For consensus generation, it is useful to have rules as to a minimum coverage for calling and what you report where the sequence coverage drops below that minimum.  Since tagmentation breaks up the fragments, such coverage drops can literally be anywhere.

A very interesting edge case -- quite literally -- is if an insertion or  deletion variant coincides with the boundary between regions with sufficient and insufficient coverage.  Since indels alter the numbering of bases in the consensus versus the number of bases in the reference, correctly reporting the range can be trickly.  But more importantly, knowing whether your code is robust to such variation is tricky.  

So the challenge is to generate such test data.  There are many Illumina read simulators out there, but typically they are focused on shotgun sequencing.  The pattern of coverage in these tagmented amplicon libraries may be very different than shotgun coverage.

So what I'm imagining is something which the user supplies a specification of what the desired coverage profile is and what mutations to generate relative to the reference.  The program would then take an existing BAM alignment and use some combination of downsampling (for coverage) and read munging (to generate a new dataset with the desired properties

One interesting aspect of this problem is the downsampling, because it will likely be drastic -- say 10,000X to 10X.  A useful adjunct to the read generator is some sort of analysis showing that the downsampled data preserves key features -- perhaps including the fragment length distribution -- of the original data.

Are these good exercises?  Bad exercises?  Badly specified exercises? Too hard?  Too easy?  Let me know what you think!

No comments: