Tuesday, August 31, 2010

Worse Could Be Better

My eldest brother was in town recently on business & in our many discussions reminded me of the thought-provoking essay "The Rise of 'Worse is Better'". It is on a thought train similar to Clayton Christensen's books -- sometimes really elegant technologies are undermined by ones which are initially far less elegant. In the "WiB" case, the more elegant system is too good for its own good, and never gets off the ground. In Christensen's "disruptive technology" scenarios, the initially inferior serves utterly new markets priced out by the more elegant approaches, but the inferior technology then nibbles slowly but surely to replacing the dominant one. But a key conceptual requirement is to evaluate the new technology on the dimensions of the new markets, not the existing ones.

I'd argue that anyone trying to develop new sequencing technologies would be well advised to ponder these notions, even if they ultimately reject them. The newer and more different the technology, the longer they should ponder. For it is my argument that there are indeed markets to be served other than $1K high quality canid genomes, and some of those offer opportunities. Even existing players should think about this, as there may be interesting trade-offs that might go after totally new markets.

For example, I have an RNA-Seq experiment off at a vendor. In the quoting process, it became pretty clear that about 50% of my costs are going to the sequencing run and the other 50% of costs to library preparation (of course, within both of those are buried various other costs such as facilities & equipment as well as profit, but those aren't broken out). As I've mentioned before, the costs of the sequencing are plummeting but library construction is not on such a steep trend.

So, what if you had a technology that could do away with library construction? Helicos simplified it greatly, but for cDNA still required reverse transcription with some sort of oligo library (oligo-dT, random primers or a carefully picked cocktail to discourage rRNA from getting in). What if you could either get rid of that step, read the sequence during reverse transcription or not even reverse transcribe at all? A fertile imagination could suggest a PacBio-like system with reverse transcriptase immobilized instead of DNA polymerase. Some of the nanopore systems theoretically could read the original RNA directly.

Now, if the cost came down a lot I'd be willing to give up a lot of accuracy. Maybe you couldn't read mutations out or allele-specific transcription, but suppose expression profiles could be had for tens of dollars a sample rather than hundreds? That might be a big market.

Another play might be to trade read length or quality of an existing platform for more reads. For example, Ion Torrent is projected to initially offer ~1M reads of modal length 150 for $500 a pop. For expression profiling, that's not ideal -- you really want many more reads but don't need them so long. Suppose Ion Torrent's next quadrupling of features came at a cost of shorter reads and lower accuracy. For the sequencing market that would be disastrous -- but for expression profiling that might be getting in the ballpark. Perhaps a 16X the initial chip -- but with only 35bp reads -- could help drive adoption of the platform by supplanting microarrays for many profiling experiments.

One last wild idea. The PacBio system has been demonstrated in a fascinating mode they call "strobe sequencing". The gist is that the read length on PacBio is largely limited by photodamage to the polymerase, so letting the polymerase run for a while in the dark enables spacing reads apart by distances known to some statistical limits. There's been noise about this going at least 20K and perhaps much longer. How long? Again, if you're trapped in "how many bases can I generate for cost X", then giving up a lot of features for such long strobe runs might not make sense. But, suppose you really could get 1/100th the number of reads (300)-- but strobed out over 100Kb (with a 150bp island every 10Kb). I.e. get 5X the fragment size by giving up about 99% of the sequence data. 100 such runs would be around $10K -- but would give a 30,000 fragment physical map with markers spaced about every 10Kb (and in runs of 100Kb). For a mammalian genome, even allowing for some loss due to unmappable islands, that would be at least a 500X coverage physical map -- not shabby at all!

Now, I won't claim anyone is going to make a mint off this -- but with serious proposals to sequence 10K vertebrate genomes, such high-throughput physical mapping could be really useful and not a tiny business.


Anonymous said...

Using the PacBio strobed reads for physical mapping is an interesting idea. I think that there are some computational problems with it though, unless PacBio can get much higher accuracy on single reads than they currently get. It doesn't do much good to have spaced reads if the error rate is so high that they can't be mapped to contigs assembled from higher-quality data.

nexgensequencing said...

One thing we seem to be overlooking also is the polymerase need not be holding on to the DNA for multiple 10 Kb runs. In fact I doubt the polymerase can be processive for more than 1 Kb on a regular basis. So for strobe sequencing we are looking at possibly 100-200 base sequences (in a circle) being looked at multiple times. Another huge disadvantage with the strobe method is the inability to initiate the strobing at specific positions (at least for now. It is easily fixable), in effect you might only be doing a 1x coverage of the whole sequence rather than 5x coverage of a particular section of the sequence, which in my opinion, is more useful given error rates.

Keith Robison said...

There appears to have been a slight confusion between the circular consensus mode on PacBio and strobe sequencing with very long fragments. The PacBio Science paper puts the processivity of their polymerase at over 70Kb and PacBio has claimed they can go 100Kb.

I'd agree that what you get from each strobe is unpredictable & may be of low quality -- so perhaps a lot of the reads in my scenario would not be (fully) usable and perhaps fewer, longer islands would the best strategy.

Jackal said...

I think that you have put your finger on a major cost consideration here which is frequently overlooked. The steep increase in sequencing output associated with the NGS "race to the top" can be used for sequencing large genomes (including heterogeneous cancer genomes) in one run. Alternatively, the output can be used to sequence pooled smaller targets in a highly multiplexed fashion. If one wishes to deconvolute the reads, a molecular barcode needs to be incorporated into each sequencing library. Many future research and diagnostic NGS applications will require multiplexing, and unless much cheaper library preparation methods methods are developed, we will be unable to fully benefit from the inexpensive ultra high output NGS.