Wednesday, November 16, 2016

HGP Counterfactuals, Part 6: Ax Sharpening Only

At the beginning of this series, I promised two alternative histories on the Human Genome Project.  Yesterday I explored a timeline in which opponents of the HGP successfully kept it from ever being funded.  Today, I'll try to imagine what would have happened if the project had been funded only to develop new sequencing technology.  One warning: as part of this I will show the most reviled plot in genomics, but to make something other than the usual point.

While it does take a leap to explore any counterfactual scenario, the idea the the HGP could have been funded only for technology development is probably less far-fetched than a complete blockage.  Advocates of sequencing technology, such as George Church, argued that the solution to human genome sequencing was not to settle for small advancements existing technologies, but rather develop radical leaps.  This approach would also have avoided the trap I discussed in the installment on 1990s sequencing technologies, in which useful sequencing forced genome centers into near-term production goals, ultimately starving "blue sky" technology development.  Holding off the mapping proponents would have been challenging; a tech-dev only solution would have been politically difficult.  But, if the powers-that-were had only secured funding to support a few dozen R01-type grants rather than mongo sequencing centers, perhaps the wisdom of a technology first strategy would have won out.  Note the NHGRI graph below on genome budgets; even if kept to about $50M per year, a lot of technology development could have 

Nearly everyone in genomics hates the NHGRI's plot of human genome sequencing costs, mostly because it is so overused.  It is also very misleading, as the fine print explains.  Many costs of sequencing aren't properly included, and it is particularly dangerous to try to extrapolate from this plot to a plot for any other size genome.  The usual comparison as "better than Moore's Law" has become very tired as well.  But, the plot is data and useful, especially combined with a differently scaled plot that goes back further.

This first plot shows the cost of sequencing in the timeframe 1990-2004 given in cost per basepair of finished sequence (from  I'm actually a bit shocked that the halving time is as good as two years.  Note the relatively constant slope; the wiggles and such in it are as likely measurement error as anything.  This is the time of the peak HGP effort, with cost reductions coming from various improvements in the fluorescent sequencers (such as read length) and economies of scale.

Here's the notorious NHGRI plot, plotting a different variable (an estimated cost of sequencing a whole human genome) from 2000 to 2015.  We can see that the slope was relatively constant until about 2007, and then it radically changed.  That corresponds to when polony-type sequencers, particularly from Solexa/Illumina, started taking over.  Now that's real change!  That steeper slope was sustained for about 4 years then tailed off.  We really haven't seen yet if the new slope around 2015 is sustainable (I think that is the Illumina X10's contribution), but with so much activity in the space I would expect that even if we are really in another shoulder, another steep drop is coming.  Also, this plot is really for re-sequencing a genome; the cost of data to support de novo assembly of a human is still in the $10K-$20K ballpark (I think, but I usually get in trouble for making such estimates).
So, what would this world look like, in which only technology development was funded?  Ideally a lot of technologies would be funded at modest levels.  This discourages building large overhead, and also there was no crystal ball.  I called these technologies exotics, but what were they?  Here are the ones I knew about and can remember; please chime in within the comments with ones I have missed.  Many of these would probably have failed even with more investment, but that is a bit the point.  Nobody had a crystal ball, so a pool of possibilities was needed to give good odds that one or a few would perform.
Out of the technologies I described in the sequencing section, two of the multiplex techniques might have really pushed boundaries given more investment.   I also covered some of this ground a while back from a different angle, which was to ask what conditions drove the explosion of "next generation" sequencing technologies in the latter half of this century's first decade.

In the last couple of years I was in the lab, George was funded for an approach that would have pushed the multiplex sequencing technology to possibly hundreds of probings per blot, and also made the probings simultaneous.  Instead of radioactivity or chemiluminescence, labels would have been mass-encoded and read by mass spec. I'm not sure why that technology didn't come to fruition, but it sounded promising.   The effort to switch multiplex sequencing from Maxam-Gilbert to Sanger might well have also borne fruit, if given more time.

I've previously described some of the early work in George's lab on nanopore sequencing; unbeknownst to us at the time David Deamer and Dan Branton were going down similar routes. Over twenty years later, nanopore sequencing is now a reality; could that time have been shortened with more investment?

Several groups independently envisioned sequencing-by-synthesis sequencing concepts in the mid-1990s which would yield 454, Polonator, SOLiD, Ion Torrent and Helicos.. Kevin Davies covered these nicely in his $1000 genome book.  George's lab started working on it only after I had effectively left in fall 1996, but some of the seeds of it were already there.  George was already thinking about the problem of PCR on a solid surface, having worked with someone who had developed phosphoramidite monomers which could be covalently linked to polyacrylamide.  By 1999, George's group had a manuscript and I think some of the other groups had patents.  With more secure public funding for development, these technologies might well have come to fruition several years earlier, though they might also have pushed the limits of image analysis 

Another technology that was proposed was to further expand and miniaturize fluorescent Sanger sequencing.  ABI launched a 96-capillary sequencer; why not 384?  Why not 1000?  It is hard to see radical improvements here, other than possibly spreading the capital cost of the instrument across more samples.  Microfluidic Sanger instruments were proposed also, which would offer the promise of reducing reaction volumes and therefore the consumption rate of the expensive fluorescent dideoxy reagents.  Such a device was described in 1995, though the perennial challenge with microfluidics is interfacing them to the macro world.

Electron microscopy has an allure for DNA sequencing; why not just stretch DNA out and image the sequence?  Unfortunately, that allure still hasn't paid off.  In the 1990s, the hot idea was scanning tunneling electron microscopy, known to have atomic resolution.  Unfortunately, it turned out that what was original thought to be promising DNA signals turned out to be patterns in the graphite substrate the DNA was mounted on.  More recently, several companies have been trying to use transmission electron microscopy.  Halcyon Molecular is defunct, but ZS Genetics seems to still be plugging away (well, their website is live if potential stale).  Could better funding have gotten these off the ground?  That is, of course, unclear, though it is obvious that these "big instrument" companies are a bit of a throwback given the current trend in benchtop or pocket-sized sequencers.

At one of those Hilton Head conferences I remember George introducing me to Ed Southern, who was trying to work out ways to sequence DNA by either building it up or degrading it.  Later on at least two groups would try to develop sequencing based on capturing the bases released by an exonuclease degrading the DNA.  Oxford Nanopore would also attempt this, using nanopores to provide the analytical capability that had stymied the other groups.  For example, Seq Ltd's approach required chilling the stream of nucleotides down to liquid helium temperatures so that the native nucleotides could be distinguished by spectroscopic methods.

Let's see, what else was floating around in the mid-90s for ideas.  Base extension on Affymetrix microarrays.  Pyrosequencing, which would be successfully deployed by 454 in the mid-2000s.  Lynx's sequencing-by-ligation approach with clonal DNA on beads (conceived, I believe, by Sydney Brenner and far ahead of its time).

Ahhhh, the one I almost forgot tonight that was tried very seriously without much impact, but ended up having a huge impact on informatics development.  Sequencing by hybridization went through several academic and corporate phases, but I don't know of a public sequence generated with it of any consequence.  The idea was to array cloned DNAs on a membrane and then probe them for every possible N-mer.  Given the challenge of probing with all oligos of some length, N was typically 8.  The mathematical challenge of reconstructing the sequence from such data was soon transferred over to the problem of working with the short sequences generated from early sequencing-by-synthesis instruments, which could generate much longer accurate subsequences.  These kmer-based (yeah, the letter switched) algorithms still dominate short read assembly.

And probably a few I've forgotten plus some more ideas that might have sprouted if teams had been funded to think about them.  

So how might this have played out?  In my prior scenario, I came to the conclusion that Craig Venter's path would have been only modestly changed by a lack of an HGP.  His sequencing factories wouldn't have had quite the same growth in efficiency, given the much smaller network effects, but still by around 1998 or 1999 he probably could have talked his way into seeing Celera launched, after first passing through HGS and TIGR.  Could a reasonable competing technology have been developed in that time so that a public, inexpensive genome project could have been launched?

Well, that's about eight years to get a technology working.  As I mentioned, sequencing-by-synthesis in the Church lab apparently went from idea to proof-of-concept in under two years.  Pyrosequencing was first described in 1998; 454 started publishing in 2005.  Those two examples suggest that under a decade would be plausible for generating radical new sequencing technologies that could radically bend the cost curve.

Of course, should real progress have been made and that technology made publicly available, Venter wouldn't have ignored it.  So perhaps this plan would have saved Celera a sizable pile of money as well.  

Depending on the technology that was developed and its characteristics, the strategies for both making and sequencing physical maps could have changed radically.  For example, suppose a short read technology had emerged in the min-1990s (about a decade earlier than reality), capable of generating 20-30 basepair reads.  The informatics developed for sequencing-by-hybridization would have again been converted to this new task, though that would take time.  With a sufficiently efficient process, restriction fingerprinting (a technology not on a path to a sequencing technology) could have been replaced with a more high-content approach.  The question of developing a minimum-tiling path  of clones across the genome might well have become moot; if clone sequencing is efficient enough one needn't bother. 

A less-rosy outcome would have been all around failure to hit the desired goals.  Perhaps the imaging and computation of the time would be too pushed, or perhaps the informatics methods to use these new tools wouldn't be developed quickly enough.  As I've noted, short read sequences have only proven their mettle because they had a good reference to map them to; de novo assembly of eukaryotes with short reads leads to a horrible fragmented mess.  But short read sequences are very suitable for expression analysis, particularly in unsequenced genomes, so even if a good mix for genome sequencing couldn't be found, EST sequencing would have been revolutionized (and renamed even sooner to RNA-Seq!). This was the path Lynx was going down, reading very short tags from cDNAs.  

So a technology-development strategy might have been high risk, potentially not moving the price of a human genome down very much.  Or it could have led to a radical price drop, enabling a lot more biology to be done for a lot less.  That's the problem; counterfactual crystal balls are as non-existent as real-world ones.

In the final installment in this series, I'll reflect on the lessons I learned going through this.  I tried hard not to have strong pre-conceptions going in, and I believe it paid off in that I gained some new insights on where the genome project could have gone differently to overall advantage.  Of course, I do come with a lot of prejudices and opinions, so someone else going through this exercise would certainly favor different paths through an alternate history, probably leading to different outcomes.

No comments: