Wednesday, March 21, 2018

A Most Unfortunate Sequencing Error

If you are in the sequencing business, you'd like to get things right.  But sequencing is a form of measurement and measurement has error.  No matter how diligent and committed you are, sometimes the data doesn't break your way.  Mick Watson has a set of posts and a preprint illustrating quality issues in many deposited bacterial genomes.  Some of those are bad luck and some of those are from complacency.  Some errors radically affect biological interpretation and some don't. I'm going to detail here one of the worst cases of bad luck I've seen, where relatively small errors sat undetected for over a decade and triggered some published head scratching over their erroneous implications. So let's look at the rap sheet of this error.

But first, some background -- and I'll struggle to keep it short.  The errors are embedded in the story of a fascinating and important pharmaceutical and I've spent over half a decade of my life thinking about it now and have therefore accumulated a rich trove of scientific detail and trivia -- and I'm notorious for not wanting to keep such lights under my basket.

In 1975 a group from Ayerst Laboratories published a description of a new antifungal compound and the bacterial strain that produced it.  Because the strain had been isolated from a soil sample from the island Rapa Nui (aka Easter Island), the researchers called this new molecule rapamycin.  Two years later another group published that rapamycin had immunosuppressant properties. And that really got things rolling.

How important is rapamycin?  Try 35K hits in PubMed as I write this -- and 42 years works out to fewer than 16K days.  So rapamycin has averaged more than two citations a day, and that first decade didn't contribute many of those.  Rapamycin, along with FK506 and cyclosporine, made organ transplantation successful.  Rapamycin and its analogs ("rapalogs") have also proven useful in some cancer settings and have been explored for metabolic diseases. As a pharmaceutical it is known generically as sirolimus.  Rapamycin acts by inhibiting a large protein kinase which is part of a central signalling pathway related to cellular energy levels, a protein known as mTOR in mammals and TOR in yeast.  What does TOR stand for?  Well, Target of Rapamycin -- it was via this useful compound that one of the conserved core signalling pathways of eukaryotic cells was discovered.  Millions of years before humans, Streptomyces discovered this -- and we only found it by cribbing from them.

In 1996 a group from the University of Cambridge cloned and sequenced the gene cluster responsible for producing rapamycin in the original Streptomycete.  This was originally called Streptomyces hygroscopicus, but that's a huge catch-all bin which covers a lot of phenotypic space, so the modern term for this strain is Streptomyces rapamycinicus.  Because rapamycin was so interesting, natural product groups went hunting for other strains that would make it, and not only found other Streptomyces but also a related but very different genus of Actinoplanes.  How different?  Well, both make spores -- but Actinoplanes spores have flagella and move! And Streptomyces have linear genomes but Actinoplanes have circular ones.  Plus lots more.

Since it was 1996, it's obvious how the cluster was sequenced -- fluorescent Sanger.  At 100kb, it was quite a feat -- it was about the time I was abandoning an early hobby project of keeping an online list of all Genbank entries I called "The 100kb club" (alas, I don't have a copy -- don't know if the rapamycin cluster made it on it). The group at Genome Therapeutics I collaborated with was the only large sequencing group at that time not using fluorescent Sanger -- they used George Church's radioactive multiplex Maxam-Gilbert method.  Fred Blattner's Wisconsin group gave up on radioactive Sanger around then.

Finding the cluster was old school -- remember that at that time only two complete bacterial genomes were available -- the lowly Mycoplasma genitalium at about half a megabase and Haemophilus influenzae at about 2 megabases.  The first Streptomyces genome wouldn't come until early in this century -- and for good reason.  Not only is the GC content averaging around 72% -- with very little dipping much below the mid 60s -- but they genomes are larger. Much larger.  Streptomyces rapamycinicus comes in at ten megabases.  So they found the cluster not by sequencing but by radioactive probing of a clone library, using a fragment from another cluster (erythromycin) thought to have a similar biosynthetic machinery.

According to the M&M, every base was covered three times by reads.  Which makes my story a bit harder to explain, because somehow things still went unfortunately wrong.  There's now a sequence in Genbank of the whole genome that was assembled from 454 data and that agrees (with minor differences) with assemblies built with Illumina and PacBio data.  Those assemblies testify to how lucky the cloners got, as the genome contains multiple clusters that use the "polyketide synthase" type machinery that rapamycin and erythromycin use.  The paper doesn't report any false starts, but it would be shocking if they didn't have some based on all those other PKS clusters.  Indeed, they did publish another PKS cluster from this strain, which may well have popped out in the primary search.

So what went wrong?  Well, errors can always crop up, particularly in a GC-rich genome.  Miscounting a homopolymer is easy, particularly late in Sanger reads where the peaks are tightly spaced.  Or the data may be distorted by compressions, regions where local secondary structure muddles the spacing of peaks.  That should be taken care of by the "three read coverage" policy, except if some regions are all covered by only the tails of three reads.

I've spoken above of the rapamycin gene cluster.  This is how complex compounds are often synthesized in bacteria, fungi and occasionally plants -- all the specialized genes are packed together into a contiguous stretch of the genome.  One of the genes that rapamycin needs is a chorismatase that converts chorismate, used in primary metabolism for aromatic amino acid biosynthesis, into the starter unit for the polyketide machinery.  Chorismatase is a very rare enzyme, found primarily in other rapamycin-like clusters (such as FK506) and a few other natural product classes.  In rapamycin this gene is called RapK, partly by the ordering of the genes but it does make a nice mnemonic (Khorismatase?).

There are a number of discrepancies between the original Sanger sequence and the modern high throughput ones, but most are relatively trivial.  But one is decidedly not, a pair of erroneous regions that each shift the frame in compensatory ways within the RapK gene.  That is always unfortunate, as a single frameshift may be easier to spot as so much of the sequence will go awry.  RapK appears to have been the first chorismatase sequenced -- the FK506 cluster wouldn't appear until the next year -- so it would be hard to spot in any case.  But once FK506 was sequenced the discrepancy between two ORFs for homologous proteins with one frameshifted would be quite apparent.

Here's the double shift. with the original sequence on top and the correct sequence below

P  P  V  F  S    G  H  L  A  L  A  A  G  G  G         R  L  F  V  S  A  T  A  
P  P  V  F  S  R  A  T  W  L  S  P  P  G  A  D  D  G  R  L  F  V  S  A  T  A  

That is nasty.  The first gap is quite easy to picture -- two homopolymer runs miscounted.  The second is a bit more puzzling without the original data, but presumably was due to a compression.  The M&M doesn't state that a 2+1 rule was required -- two reads on one strand and one on the other - which is a defense against compressions as they typically would generate different anomalies in each orientation.

Why is this so awful?  Well, as more chorismatase sequences entered the public databases a pattern emerged -- that the starred arginine in the lower sequence is absolutely conserved, -- with RapK being the exception.  Not only were multiple chorismatases from FK506-type clusters sequenced (called FkbO there), but also the RapK from a producing Actinoplanes strain.  Plus from that other cluster that was published from Streptomyces rapamycinicus, the cuevanes.  So the first error exactly took out a key residue and the second error helped hide this fact.

Indeed, when a crystal structure for FkbO was published, it showed the architecture that relies on that arginine.  The authors noted that the RapK sequence was anomalous: "Arg228 is the only active-site amino acid not present in all FkbO-type chorismatases; in the chorismatase from the rapamycin cluster (RapK), a glycine is present at this position".  The authors are lucky some reviewer didn't ask them to explore this further, as given the available data it is a big puzzle -- but of course the data contained the nasty error.

So what can we conclude from all this?  Trust but verify!  But, that's rarely practical.  Trying to validate every quirky sequence is financially impossible, even if you had access to all the biological material.  In this case, a very careful analysis of the sequence with a frameshift-aware protein-to-DNA aligner with FkbO as a query might have detected the error, but of course databases could easily have even shorter regions flanked by compensating frameshifts.

I do deserve some blame here -- I've known about the error since sometime in 2012.  Unfortunately, at that time the company's interest in rapalogs was a closely guarded secret.  And without an independent sequence, how to demonstrate that the old sequence is wrong?  Eventually we did make public our SMART platform -- Small Molecule Assisted Receptor Targeting -- which in one version uses rapalogs.  That's a fun story for another day, and besides my days are spent in other ways --  discovering useful molecules bioinformatically.  Alas, the acronym for that would be the opposite of SMART -- fitting but not one I'd like to use!  I've meant to get around to writing this up -- and warning UniProt -- since we went public, but in my usual way never quite did so.  Until now!

Sometime in the near future I'll take up this theme in a slightly different form, following up on a previous post that promised an as-yet undelivered sequel


Some pedant said...

*Mycoplasma genitalium

Keith Robison said...

Thanks for catching that! I work with Mycobacterium sequences routinely, so my brain took the easy route instead of the correct one.

Curt Becker said...

As a founder of ABI who had the pleasure of being more than intimately involved with commercializing DNA synthesis, DNA sequencing and RT PCR.... I have to love the article of those who followed academics' processes and damaged scientific progress. Sorry, Keith... Love your work and contributions to death... but analytical systems are better left to the dedicated (nameless vs celebrity scientists) industrialists who develop and commercialize them.
Just my humble opinion....

Unknown said...

Sorry to go off topic, though the new SSGAC Educational Attainment GWAS is very big news. I am not sure whether it is proper form, though might a bit of homework assistance be requested? I am not sure who else to ask! How do you politely ask people, if they might know where phased human genotypes might be available? What I am looking for is a few phased genomes to see how much variation would be present between a normal gamete and an optimized gamete for EA.

This would seem to be an extremely important question as nearish term such selection might be doable. Wonder how many SD of IQ we could likely enhance intelligence given the 3000 SNPs found in the SSGAC study.

Keith Robison said...

J Ir: Try Genome In A Bottle. ( - several human genomes have been phased by multiple methods and you’ll find the data there.

Unknown said...

Public Service Announcement:

All the kids out there working on their biology homework be advised that
the UCSC Genome Browser and its tools are really great for genomics. They
have 2,000 phased full genomes from the 1,000 Genomes Project that can be assessed.

There are some other websites with genomics tools (that I will refrain from specifically identifying) which require users to somehow interface with their API. I am sure there must be a very thick and complicated user manual that one could to figure all that out if one were so inclined.

This suggestion might even be more helpful than the Genome in a Bottle idea given above.