Thursday, May 02, 2019

Poking at Genapsys Preprint

Genapsys is continuing down the path of pre-launch information, most recently releasing a pre-print.  I'm looking at this pre-print critically and unfortunately turning into a bit of Reviewer #3.  Not that anything is fatal and pre-publication review is a key value to pre-prints. If I were an actual reviewer I'd be writing mostly the same things and covering more vertebrate species than they sequenced (a human exome panel  was included, though most samples were bacterial) -- I'd grouse about a missing figure (which I've provided), carp about critical details not provided and beef over a public data deposit that doesn't really line up with the unqualified claim made in the paper. (*)
Earlier this year I looked at Genapsys' patent portfolio, finding a wide array of ideas about how to improve on sequencing-by-synthesis technology.  The Genapsys preprint, which parallels their talk given back at AGBT, does not incorporate many of the wilder ideas.  What we do have is a system that uses flows of individual unmodified nucleotides to create extensions, which are detected by the change in impedance at an electronic sensor.  Unlike the similar 454 and Ion Torrent technologies, this signal is stable, increasing stepwise with each incorporation.  Some beads in the reaction contain not template but deliberately blocked extensions; these are used for calibration.  The various ideas for dealing with loss of phasing aren't mentioned are are presumably absent.  The current flowcell, with some number of sensors (despite searching repeatedly, I can't find the number -- but then again my glasses could be mistaken for Hubble telescope mirror blanks), yields over 1 gigabase of data.

Let's start the remonstrations with the data not made public.  The authors describe results for three bacteria: Camplyobacter jejuni representing low G+C (mean 30%), Escherichia coli in the middle and Bifidobacterium animalis subsp. Lactis representing elevated G+C (mean 61%).  As an aside, this last one hits a pet peeve: that isn't high G+C!  Please people, there are easy-to-grow organisms such as various Streptomyces or Micrococcus that are over 10 percentage points higher.  And it matters, as I'll point out a bit further.  Anyway, they also sequence a human exome panel and a low input cancer sample.  The paper claims "Sequence data has been deposited into the Sequence Read Archive (SRA) under BioProject accession number PRJNA529876", but only the Escherichia dataset is at this accession.  That's frustrating because it would be useful, as shown in a moment, to have the extreme G+C datasets as well.

On 454 and Ion Torrent chemistries using unterminated nucleotides, homopolymer counting errors dominate the error profile.  I've seen a bit of Twitter traffic anticipating similar for Genapsys.  Unfortunately, the pre-print doesn't go into any depth on this key topic: accuracy/error information is only presented as bulk statistics.  But, this is where the public dataset comes into play: with the wonderful tool counterr I analyzed the E.coli data.  Based on this data, it looks like Genapsys is amazingly accurate at counting homopolymers where the true length is 6 or fewer (there's a tiny shoulder of 1 short runs for true length 6 G or C homopolymers), but then starts falling apart.  When the true length is 7 there is a substantial number of length 6 calls, and 8 has a similar profile of calling one short at 7.  But 9 is a mess, with more calls of run 8 than the correct 9 and 10 is all over the place -- but there's very little data for homopolymers of length 10 (indeed none for A or T runs of that length).  But that's were I really wish I had the extreme G+C data (and especially wish I had data from a true extreme of G+C content), as the number of events for the longer homopolymers is quite low -- so it is hard to get a good handle there.  In any case, this is the sort of detailed error analysis that should be in the preprint, not just the bulk error statistics (such as 0.15% error on the E.coli set -- buried in the supplementary material --  with 0.016% substitutions, 0.089% deletionsand 0.044% insertions) that are provided.

On experimental details, data is lacking.  If most of the authors are from a manufacturer and this is the first presentation of a system, then saying "Library molecules were clonally amplified onto beads following the manufacturer's recommendations (Genapsys)" is well nigh useless and fails the minimum standard for reporting a scientific procedure.  So we're still in the dark as to the basic methodology of this critical step; there are absolutely no details as to whether this is emulsion PCR of some other scheme.

The use of blocked reference beads is an interesting element of the system, but what is the ratio of experimental to reference beads?  No details.  The timing of flows?  All we get is "the chip was incubated for several seconds" --is that a constant or is it plus or minus a moment?

Similarly, the approach calculation of Q-scores is not described in a way that could be replicated: "For every base call a set of predictor measures were computed that were used as input into a pre-trained Phred quality table in order to obtain a quality score".  No details on the training -- in this day of fancy machine learning models that is a striking travesty.  And we can find from counterr that their model consistently overestimates phred scores at the high end; a reported phred of 45 is really closer to 40.  Another analysis that should be in the paper -- and perhaps if they had done this they'd recalibrate the models.

To beat on this horse one more time, the section on variant detection lacks most details as to parameter settings.

Okay, enough complaining.  The pre-print does give us some additional information.  Given the proposed instrument pricing of around $10K (IIRC) and running costs of several hundred dollars per flowcell, this could be an interesting instrument for small labs looking for a low-footprint, low capital cost instrument.  Genapsys is, of course, promising bigger chips than the current one, but time will tell whether they can be more successful at chip scaling than Ion Torrent proved to be.

* - I do note another mammal in the thesaurus entry for complaining, but even if I ignored the sexist overtones of it and used it, my companion -- who is actually a literal example of that term -- is likely to express objection by growling and barking

No comments: