Sunday, June 29, 2025

Could It Have Been Found With Short Reads?

Initially, the ESHG program was overwhelming.  With the exception of the official opening and closing sessions, every timeslot had multiple parallel sessions - and sometimes those also had competing corporate sessions.  Everything would be recorded and available for playback through November.  That took some pressure off "did I pick the wrong session?", but also is a double-edged sword - I might be attending ESHG for half a year! So I decided that my focus would be rare diseases, and if competing sessions on rare diseases then the one that most focused on genomics technology.  And that led to hearing what amounted to a refrain in the questions at the end of each long read talk: could this causative variant have been found by short reads?

It must be said that none of these queries seemed to be launched by paladins of the The One True Sequencing Technology.  They were always placed amicably and I think genuinely.  This is a crowd dominated by clinicians who wish to solve diagnostic odysseys, ideally on the first try.  So they would like to know: must they transition to long reads?

Judging from talks and posters, short read exomes still dominate the field.  It could be worse: a clinical geneticist from Ukraine I happened to sit next to for lunch one day told me that until recently all of her patients were using panels - and all were self-pay.  Radboud University in the Netherlands has stepped up and is offering sequencing for her patients now, but before that - panels.  Not that this was the sole mention of panels at the conference. 

Whenever "could this have been seen with short reads" was asked, the responses were polite - and generally "can't really say".  Scientists don't like to stake out ground where they don't have data, and in most cases there are not parallel short read datasets to answer the question definitively.  And what short read?  PCR-free?  What insert size?  What coverage?

There's also the difference of "could you have found this initially with short read?" versus "can you find it when you know to look for it?"  The latter slices through all the ambiguity and false positives also present in the dataset - it is much easier to rediscover what you know is there than to find it in the first place.

But there were also some really strong candidates for "no, this isn't being detected by short reads".  One that really stuck with me was a talk on recombination - often in the form of an inversion - between a functional gene and a pseudogene copy of the same gene.  In the most challenging cases, this would truly be undetectable by short read WGS, as it is a balanced rearrangement and the breakpoints may be in regions of identity between the gene and pseudogene which are longer than a short read.  Now, if we expand short read to RNA-Seq and the mRNA is present in an accessible tissue, then expresion of the pseudogene bits could be a tell-tale.  Or perhaps even not, if the pseudogene is expressed in unrearranged form.  And this again asks the question: which short read do you mean when you ask "could it be detected with short reads?"

There were other examples where assigning variant causality - or knocking down the hypothesis down the list - requires establishing phase between two variants in the same gene.  Are they in cis or in trans and therefore creating a compound heterozygote?  Depending on the neighboring polymorphisms, it may be difficult to impossible to haplotype such variants using short read data.

Another interesting example concerned a gene which doesn't sequence well with short reads, so the variant was missed. If that "not sequencing well" varies by kit or process - was this PCR-free?  Or perhaps an exome capture would do better than WGS in this region.  So again the question of "could it be found with short reads" becomes murky.  
 
Other interesting examples included methylation - the ability of long read technologies to directly read methylation becomes a huge plus.  In one case, translocation of X onto an autosome caused X-inactivation to spread into the autosomal component, silencing a gene.  In another, a tumor suppressor was inappropriately methylated, creating the first hit in the path.  Of course, it is possible to read methylation by short reads by preparing special libraries - and I'll touch on some new methods in a later post - so even this isn't quite a pure win for long reads.

And next year in Gothenburg it is likely to get murkier, with two more technologies muddying the waters.

As I noted previously, Roche's SBX (now christened Axelios) will offer large numbers of reads in the 800-1000 basepair range, which I call "midi-read sequencing".  Recently I heard someone say "what some commentators are calling midi reads", so maybe the coinage will catch on.  In the Roche corporate session at ESHG, SBX wizard Mark Kokoris presented new data on midi reads, with tuned library prep and sequence expansion protocols delivering even greater yields.  SBX accuracy has also been pushed slightly higher.  Kokoris showed data demonstrating how midi read libraries can improve haplotyping of variants.  Duplex reads will still give significantly higher accuracy - Roche now flat out claiming Q40 - but not having the haplotyping power.  So how rare disease groups use SBX - one library of each?  Just midi?  Just duplex - remains to be seen.  But widespread adoption of midi reads for rare disease diagnosis could make the "could this be found with short reads" question far more nuanced.

Even more so with Illumina's Constellation technology.  This is a clever approach that performs the input molecule tagmentation on flowcell, which means that nanowells near each other on the flowcell are often derived from the same long input molecule.  Illumina's DRAGEN hardware sorts all that out into haplotype calls and structural variants. Illumina had a poster on Constellation at ESHG, and like previous ones it showed impressive haplotype block lengths.  The one catch right now is that it's one NovaSeq 10B lane per library, so the cost per sample could be high.  It's also hard to see how well this will work for challenges such as simple repeat expansions, though there are statistical ways to call "this is expanded" even if the exact length of the array can't be solved.   There's also strong interest in Constellation from labs that haven't made the jump to long reads and are trying to avoid that transition.  A GenomeWeb piece on ESHG made this clear, with an early access user in Germany describing a high level of satisfaction - particularly with the extremely simple workflow since no library prep occurs outside the instrument.  Again, "could this have been found with short reads" will be an even more complex question - particularly since there will be very few truly challenging samples that go though multiple technologies.

As an aside, a point to ponder is why hasn't Universal Sequencing Technology's TELL-Seq method gotten any love?  Or BGI's single tube Long Fragment Read (stLFR) approach, which is essentially the same scheme, but tied to BGI sequencers. This is a clever bead-based transposition scheme that takes advantage of the tendency of DNA to wrap around beads.  Each TELL-Seq bead has transposases with the same oligo barcode - a bead barcode.  So library molecules with same barcode are from same bead and therefore probably the same input molecule.  You don't get any ordering information, whereas that might be accessible to Constellation.  It is a more conventional library prep, with PCR and such after the tagmentation reaction - whereas Constellation is just "pipet-and-go".  TELL-Seq libraries cannot be easily run on the same flowcell as standard libraries; the bead barcode is a long one and requires a custom run recipe.  Perhaps the biggest gap is in software - Illumina integrating the Constellation processing into their suite is great help to acceptance.

With the decreasing cost differential between short and long reads, another question will remain - "what are you really optimizing for?".  With a decreasing cost differential between short reads and true long reads - and Constellation right now costing far more than long reads since it requires one dedicated lane per sample - is it really worth the risk of missing a variant or extending the time to diagnosis by leading with short reads?  What are the right samples to benchmark short reads vs midi reads vs. long reads vs. Constellation? These are the challenges this space will continue to grapple with.

No comments: