Thursday, May 30, 2019

ONT's Horrid Data Storage Folly

In the coverage of ONT's announcements at London Calling, I specifically left out one of Clive's Skunk Works projects.  This would be a device to use nanopores to write digital information.  I'll give him points for creativity, but the reason I held this for here is I truly and sincerely hate the concept and didn't want my vitriol to distract within the other piece.  But here I will let loose.
DNA has attracted attention from both academics and tech giants as a possible medium for storing digital information.  Since we can both read (sequence) and write (synthesize) DNA  and DNA is inherently a digital medium, that the possibility exists should be obvious.  Most schemes use some sort of encoding with error correction to convert digital information into sequences which are then synthesized with flanking common sequences to enable PCR amplification and perhaps retrieval of subsets of information from within a larger pool.

Some basic characteristics of such a scheme using current technology -- and proof-of-concept has been demonstrated by multiple groups -- can be sketched out.  Writing is slow and expensive, requiring phosphoroamidite chemistry.  Reading is slow and expensive, requiring sequencing.  Most groups have used Illumina, but some have used Nanopore and many have tried to design their encoding to be as platform-agnostic as possible.  But copying is really inexpensive and cheap, via PCR.  Power-off storage is potentially for hundreds of thousands of years, if the DNA is dried appropriately.  And data densities are extreme. Plus, the medium is universally readable: all you need to do is write down a relatively simple encoding scheme and any future group could read data.  Chisel the encoding scheme into some nearby rocks and you're set!  So it could make sense for archival storing of large amounts of rarely-accessed information, but certainly not a replacement for any current media for something you need on short intervals.

One attractive aspect is the idea that the tech companies might apply their outsize funds towards the cost of reading and writing, enabling us biologists to free-ride on valuable improvements.  For example, enzymatic synthesis appears to be getting close to reality, with a publication, press releases and preprints from multiple groups (e.g. this and this).  However, we should not expect a guarantee that we will benefit:  preprints, from George Church's lab and another from Ron Davis's lab are imprecise about how many bases are added each cycle.  With the right encoding scheme, having one T or two Ts in the sequence might be irrelevant.  But for synthetic biology, it will almost certainly be a disaster -- and this issue would occur on every cycle.  I'll confess a certain reflexive negativity to any data storage scheme that doesn't enable synthetic biology.

Clive hates having two big devices (synthesizer and sequencer) that are expensive and the ugly chemistry of phosphoroamidites.  Those are fair complaints.   But his solution is quite awful in concept, not considering many important points.  

Clive proposes that when driving DNA through a pore it should be possible to both read it and make marks on it.  The nature of these marks was not described, but methylation is an obvious possibility.  Or perhaps deaminating cytosines to uracil.  If this could be triggered with light as Clive suggested, which is plausible, then any strand going through the system could be marked up.   

Note that this is marking, not writing!  Dogs mark; people write.  Clive talked of using natural DNA as the medium.  He mentioned complex amoeba DNA as a candidate, but never mentioned amplicons.  Perhaps he thought that point was obvious, but I worry he didn't.  One needs flanking sequences for copying, though he never mentioned the idea of copying -- he was proposing writing personal photographs to DNA and then throwing the cartridge on a shelf.  The written DNA would be on the trans side of the membrane.

In reading DNA on nanopore, only a tiny fraction of the DNA ever goes through the pore.  So only a tiny amount of DNA will be written, meaning the written DNA will be very dilute on the trans side.  So somehow the later recapture will need to be very efficient.  Unbelievably efficient.  And unlike cryogenic body freezing, I doubt the requisite consumer gullibility exists to assume that this magical recovery will be developed during the time your data is sitting in storage. But then again, to quote Mencken, nobody every went broke underestimating the intelligence of the American people.

Clive repeatedly mentioned he hadn't run any numbers.  That shows, as the numbers show this concept to be a non-starter.

MinION caught the enthusiasm of many early on because while it was enormously worse than anything existing for some metrics -- cost per basepair, accuracy -- it had other aspects in which it could beat some if not all competing systems:  cost of ownership, time to data, modification detection, read length, portability, robustness in field situations, etc.

Let's look at the competition for a data storage solution.  Let's base our analysis around Flongle and say that Flongle can generate 2 gigabases of data (not quite demonstrated) and suppose that ONT figures out a way to mark the DNA so that one base can be one bit.  We'll convert everybody else into gigabits as well -- an earlier draft using gigabytes got confusing (working with giganybbles didn't help either).  And let's just remember, but park for now, that on this device these are raw gigabits, not actual user gigabits.  Also, we are just sketching things, so we won't worry about the difference between binary giga and decimal giga or anything being very precise.   It really won't matter, as you'll see.

For around the cost of a Flongle flowcell I can buy a USB spinning disk drive which holds 4 terabytes, or to convert to our preferred units, 32 terabits.  So even in the best of circumstances, ONT is behind by a factor of 16,000!  But that's comparing raw to usable -- probably nobody will be so confident as to trust data to a single DNA strand.  Clive touted that the device should be able to read and verify at the same time -- but writing probably won't be perfect so again some redundancy required.  How much is unknown -- but is 10-fold unreasonable?  Now we're at 160 thousand fold inferiority.  So this isn't solved by simply doubling the pore translocation speed and doubling the density of pores -- that buys you only two doublings when we need seventeen! And that's just to be competitive with off-the-shelf proven technology.

Well, what about some other parameter of interest?  If current sequencing speeds of 450 bases per second per pore can be converted to writing speeds, then a Flongle-like device is going to write 2 gigabits in a number of hours.  Versus the hard drive in a matter of minutes.

What about copying speed?  By marking DNA, Clive's concept discards any sort of fast copying by PCR that is a key feature of most DNA data storage schemes.  So no win there.

Compactness? Perhaps the ONT cartridge will be smaller than the drive -- but to compare evenly we need to buy thousands of ONT cartridges to hold the data of one spinning disk.

Or we could switch comparators.  For the same sort of price we can buy an SD card holding only 512 gigabytes -- 4 terabits.  So only about two thousand fold more data capacity than ONT's proposed device.

I'm happy to entertain that the ONT device will have some axis of superiority -- but what would it be?  Any suggestions? It doesn't help that Clive is pitching this as a personal data solution.  Most DNA storage schemes are necessarily, given the large capital cost, aimed at storing big public records for many generations.  That might warrant big, expensive solutions.  But my iCloud photos aren't exactly something I'd pay a serious premium to store for millennia.

Now, in fairness, if ONT really could generate even 1 bit of stored information per 4 bases of data, they'd probably be far ahead of any of the other DNA storage schemes in terms of cost.  Phosphoamidite chemistry is at best 5 cents a base or so, which means 2 gigabits is going to cost hundreds of millions of dollars.  Which isn't an argument for ONT to plunge in -- it's a sign that storing information in DNA is still an academic problem which is a long ways from being cost-feasible.  And ONT's proposal throws away the fast-and-cheap copying aspect that makes the systems interesting.

I've been generous in giving benefit of the doubt that such a device is going to be straightforward.  As noted, modifying DNA reliably as it flies past at 450 bases a second should hardly be assumed.  In addition, given ONT's routine derision aimed at optical systems, it's a bit humorous that they are now proposing using optics.  How much will these precision optics cost? 

I'm particularly frustrated that this utterly unbaked idea is the first surfacing of a DNA writing concept from Oxford Nanopore.  A few years ago Clive mentioned that Oxford had launched a subsidiary to explore DNA writing.  A general assumption is that it would be enzymatic synthesis using the VolTRAX platform.  That's not a slam dunk; I suspect it can be done but whether the economics would work is questionable.  But that would at least be plausible and could create an unusual device with some niche applications, such as synthesizing DNA in field settings -- maybe you need a PCR primer to assess abundance of a newly discovered bacterium.  Or perhaps it would be a personal synthesizer for rank-and-file bench biologists with small synthesis needs.  

Clive's team at Oxford has generated some amazing technology, and I'd be remiss to not note that criticisms such as this were aimed at the concept of nanopore sequencing.  Indeed, each year Clive or CEO Gordon Sanghera or both review the "you'll never do that" items each year at London Calling.  But those were technical criticisms with technical assumptions that ONT busted through.  This has both huge technical problems and even more monstrous economic problems to into a market that may not exist.  

Hard technical problems have been what ONT has solved repeatedly, though often taking much longer than initially thought.  They have all sorts of interesting problems in the sequencing space for which there are clear markets.  Wasting capital and human resources on an ill-thought out plan that lacks a clear economic justification would be absolutely foolish.  As follies go, this unnamed idea of a DNA writing appliance is a potentially disastrous one.  The only value I see for ONT in this is an object lesson on not subjecting ideas to critical internal scrutiny before revealing them to the world. Bad ideas are distractions that ONT can scarcely afford, particularly when their long read competition will probably be soon well funded and with a massive sales force.


Clive G. Brown said...


Keith youre making a lot of assumptions about the architecture of the device based only on things already launched that have been designed for another purpose. I’m actually thinking about high density FET arrays with millions of channels writing say ~ 5-50 bits per channel per second. Which is plenty for personal archiving. As for the Trans side thing, again, you’re making assumptions based on current design, not future design (*sigh* everybody does this). Just for example, the droplet based systems allow trans recovery. It is also possible to unzip and rezip dsDNA keeping everything on the cis side with certain architectures, it is also possible to read-until file header information from cis side molecules and so seek sectors. There’s even a crude fseek function per molecule with unzip-rezip.

Yes i don’t care about copying. No I’m not storing, maybe even not working, in aqueous, nor drying.... and so on.

Keeping hating. Its motivational ;).

C. J. Olsen said...

This reminds me of Lantz's piece, "Why the Future of Data Storage is (Still) Magnetic Tape"

Anonymous said...

Only Clive would stop by and consider this a "hate" post when it offers him valuable criticism. Vintage Clive.