Cas9: Illustrating the Difficulty Of Calling Out Obvious
The recent patent battle over Cas9/CRISPR technology is, in my opinion, illustrative of how messy determining obviousness can be. Now, I haven't reviewed the decision in favor of the Broad's patent nor have I dug into the various filing dates and such which I believe proved pivotal, so I'm not really looking at the battle as it actually played out. But there have been frequent complaints on Twitter and elsewhere that the Broad/Zhang claims to human genome editing were obvious extensions of the U.C./Doudna work on bacterial genome editing. I'm now going to attempt to convince you that it was neither obvious nor non-obvious.
The argument for obviousness is, well, obvious, right? Cells are cells, right??
Well, let's take the example of phage integrases. If you take the integrase from PhiC31, a Streptomyces phage, and express it in human cells along with a DNA construct with the appropriate donor site, then you can integrate that construct into the human genome. While PhiC31 doesn't infect human cells, the integration site has enough slop that in a genome as large as human there can be multiple sites which function. Indeed, this could be seen as confirmation of the theory that DNA binding sites evolve to be just constrained enough to be found specifically in their home genome. In fact, PhiC31 integrase has proven so useful in this role that in PubMed I think there are more abstracts for PhiC31 use in mammals than in Streptomyces (though it is a workhorse there).
Now, try the same thing with lambda phage integrase. Won't work. Why not? Well, it turns out that lambda's integrase forms a complex with host proteins, and only with that complex formed can the integrase reaction proceed. In contrast, PhiC integrase not only can function in a mammalian cell, but as a purified protein in vitro. It actually follows that integrases with an active site serine, such as PhiC31 integrase, function autonomously whereas those relying on an active site tyrosine, such as lambda, never do.
So how does this apply to Cas9? Well, there are two general schemes for arming Cas( with the guide RNA scheme. One is to arm it in vitro and transfect the complex into cells, the other is to express the guide RNA in vivo. Only if that RNA is correctly processed in vivo with the in vivo approach work, and that processing relies on specific RNases. Was it obvious if this processing would correctly occur in mammalian cells?
I noted that PhiC31 integrase will function as a purified protein in vitro. In the electrifying Jinek et al paper, Doudna's group showed Cas9 cleavage in vitro. So that is what really made mammalian editing obvious, right?
Well, maybe. But we now know that so called anti-CRISPR proteins exist. What if somehow mammalian cells had ubiquitous anti-CRISPR activity? Or what if trying to express guide RNAs or introducing guide RNA-Cas9 complexes had triggered antiviral responses, a bane of many early attempts at RNAi?
In the end, Cas9 function has turned out to be relatively straightforward, with none of these bugbears being real. But was this obvious? What degree of uncertainty is sufficient to make something non-obvious? Is it obvious only after successful experiments, in which case success wasn't obvious? My phage integrase example is hardly unique: lambda-red recombination doesn't work even in every bacterial system.
What Basecalling Is And Is Not Potentially Covered by 9.546,400?
PacBio's basecalling patent has 15 claims. I'm going to copy all of them here because it will make things more clear. Or pretty unclear. As always when I try to read patents, I feel like I'm a Perl programmer trying to debug Prolog. It's a whole special language which isn't on the same plane as what I'm used to, so things which might be clear to someone trained in patent law are just plain nonsensical. So here's the list
1. A method for sequencing a nucleic acid template comprising:
a) providing a substrate comprising a nanopore in contact with a solution, the solution comprising a template nucleic acid above the nanopore;
b) providing a voltage across the nanopore;
c) measuring a property which has a value that varies for N monomeric units of the template nucleic acid in the pore, wherein the measuring is performed as a function of time, while the template nucleic acid is translocating through the nanopore, wherein N is three or greater; and
d) determining the sequence of the template nucleic acid using the measured property from step (c) by performing a process including comparing the measured property from step (c) to calibration information produced by measuring such property for 4 to the N sequence combinations.
2. The method of claim 1 wherein a property in step (c) comprises current.
3. The method of claim 1 wherein the translocation through the pore is driven by the applied voltage.
4. The method of claim 1 wherein the translocation rate through the pore is enzymatically controlled.
5. The method of claim 3 wherein the translocation through the pore is controlled by a polymerase, a 6. The method of claim 1 wherein N corresponds to n-mers comprising 3-mers, 4-mers or 5-mers.
7. The method of claim 6 wherein N corresponds to n-mers comprising 3-mers.
8. The method of claim 1 wherein the method is carried out on an array of nanopores in the substrate.
9. The method of claim 1 wherein the sequencing comprises peak finding by heuristic decision-tree algorithms, Bayesian networks, hidden Markov models, or conditional random fields.
10. The method of claim 1 wherein the comparing process comprises examining a lookup table for each of the 4 to the N combinations, and keeping only those meeting a threshold value.
11. The method of claim 10 wherein threshold value is within 2 sigma of the expected value.
12. The method of claim 1 wherein some of the values for the 4 to the N sequence combinations are degenerate within the error of the measurement.
13. The method of claim 1 wherein after each single-nucleotide translocation through the nanopore, the possible n-mers for that measurement are looked up, and all the possibilities from the previous measurement that are not consistent with the most recent measurement are thrown away.
14. The method of claim 1 wherein N corresponds to n-mers comprising 4-mers.helicase, a translocase, a viral genome packaging motor, or a chromatin remodeling complex.
15. The method of claim 1 wherein N corresponds to n-mers comprising 5-mers
Note that claim #6 covers 3-mers, 4-mers or 5-mers but claim #7 is for the method of claim #6 with 3-mers. Huh?? On twitter, one respondent suggested that perhaps claim #6 covers models that mix N-mer sizes but #7 for those specifically around 3-mers. Perhaps this is shades of the recent Oxford comma case.
But in any case, from a basecalling perspective the patent appears to cover training models of several types (lookup tables, decision trees, Bayesian networks, hidden Markov models or conditional random fields) with all possible nucleotide sequences of length N, where N is between 3 and 7.
I did a quick search for relevant papers published before the April 10, 2009 priority date on patent 9,546,400 and found a whole series of relevant papers with a common author. "Nanopore cheminformatics" from Winters-Hill and Akeson back in 2004 describes using support vector machines (not mentioned in the patent), EM algorithms and HMMs for analyzing nanopore data. "DNA molecule classification using feature primitives" by Iqbal, Landry and Winters-Hill (2006) also describes HMMs for classifying nanopore hairpins as well as tossing out the idea that decision trees are often used in this problem space. The abstract for "Cheminformatics methods for novel nanopore analysis of HIV DNA termini" (Winters-Hill et al, 2006) specifically describes classification of dinucleotides with HMMs. "Analysis of nanopore detector measurements using Machine-Learning methods, with application to single-molecule kinetic analysis (Landry & Winters-Hill, 2007) again mentions HMMs and SVMs. "Duration learning for analysis of nanopore ionic current blockades" (Churbanov, Baribault & Winters-Hill, 2007) and "A novel, fast, HMM-with-Duration implementation - for application with a new, pattern recognition informed, nanopore detector (Winters-Hill & Baribault, 2007), "Implementing EM and Viterbi algorithms for Hidden Markov Model in linear memory" (Churbanov & Winters-Hill, 2008) and "Clustering ionic flow blockade toggles with a mixture of HMMs" (Churbanov & Winters-Hill, 2008) all cover using HMMs for nanopore signal interpretation, with at least dinucleotides as the training set. Curiously, the Pacific Biosciences patent cites none of these papers.
So once you've established HMMs for nanopore base calling trained on 2-mers, is it truly non-obvious to try training them with longer N-mers? After all, isn't every practitioner of the art always tempted to build a bigger, more complex model when a simple one fails? Some of the above papers hint at issues with computing the models, particularly in real time, but the PacBio patent has nothing that I would see as teaching how to get past such issues should they arise. So what exactly is novel and inventive about the PacBio patent?
Thinking in the other direction, suppose I train a model on 6-mers, which PacBio did not claim. Suppose I successfully train that model and discover that the model has effectively zeroed out the last position, giving it no weight. Am I now infringing on PacBio's patent by having a model that is effectively a 5-mer model, even though I intended it to be a 6-mer model? Not obvious to me; perhaps someone with patent experience can chime in.
The one paper mentions decision trees, but mostly as a contrast to their method, which they say obviates the need for decision trees. Is that simple mention in the open literature enough to torpedo patenting decision trees for basecalling? And what algorithm is more obvious than decision trees? I suppose the answer is lookup tables, but that is also a method named in the patent.
A number of commenters on Twitter and on my piece on GridION have made relevant comments as well -- as well as general complaints about the U.S. patent office. Just to head off one rumor, I will note that PacBio's filing US20100331194, which I think is what established the priority date, long predates Oxford's electrifying 2012 AGBT presentation, and this application mentions N-mer methods and the range of algorithms described above. So it would not appear the patent cribbed from Oxford's presentation.
What is the potential impact of PacBio's patent on Oxford's current platform? First, of course, the issues of prior art and obviousness could well erode the PacBio claims to an ineffectual wisp. But even if they stand, Oxford's current base caller uses recurrent neural networks, which do not appear to be in the patents. The not-quite-released Scrappie basecaller uses "transducer" architectures; unless these resolve to one of the named algorithms ("conditional random fields"?), that is outside the patent. Yes, prior basecallers from ONT used HMMs, but those have gone by the wayside. So is PacBio again charging forward shooting blanks? I'd love to hear the counter-argument, because right now that is the opinion I would be leaning towards. Just trying to throw shade on a competitor with low-probability infringement lawsuits seems like a poor strategy and a waste of money on legal expenses.
[22-MAR-17 11:29 fixed bad URL for one of the patents]