Showing posts with label DNA sequencing. Show all posts

Wednesday, February 23, 2011

Has Ion Torrent Taken A 318-Sized Lead over MiSeq?

About a week ago, Ion Torrent's President Greg Fergus and Head of Marketing Manesh Jain were kind enough to engage me in a nearly an hour of discussion about the Ion Torrent platform. One agreement prior to our discussion is that I would withhold this piece until their announcement today of the 318 chip for the system (I also volunteered to let them see a draft of this in advance to ensure I had not misrepresented anything).

A key theme on their side is a certain degree of feeling that the wrong questions are being asked in the analysis of PGM versus MiSeq -- and an eagerness to shift the discussion. They wished to emphasize a number of points, and after the discussion I can see the validity of many of these.

What's Gnu in Sequencing?

The latest company to make a splashy debut is GnuBio, a startup out of Harvard which gave a presentation at the recent Personal Genomics conference here in Boston. Today's Globe Business section had a piece, Bio-IT World covered it & also Technology Review. Last week Mass High Tech & In Sequence (subscription required) each a bit too.

GnuBio has some grand plans, which are really in two areas. For me the more interesting one is the instrument. The claim which they are making, with a attestation of plausibility from George Church (who is on their SAB, as is the case with about half of the sequencing instrument companies), is that a 30X human genome will be $30 in reagents on a $50K machine (library construction costs omitted, as is unfortunately routine in this business). The key technology from what I've heard is the microfluidic manipulation of sequencing reactions in picoliter droplets. This is similar to RainDance, which has commercialized technology out of the same group. The description I heard from someone who attended the conference is that GnuBio is planning to perform cyclic sequencing by synthesis within the droplets; this will allow miniscule reagent consumption and therefore low costs.

It's audacious & if they really can change out reactions within the picoliter droplets, technically it is quite a feat. From my imagination springs a vision of droplets running a racetrack, alternately getting reagents and being optically scanned for which base came next & an optical barcode on each droplet. I haven't seen this description, but I think it fits within what I have heard.

Atop those claims comes another one: despite having not yet read a base with the system, by year end two partners will have beta systems. It will be amazing to get proof-of-concept sequencing, let alone have an instrument shippable to a beta customer (this also assumes serious funding, which apparently they haven't yet found. Furthermore, it would be stunning to get reads long enough to do any useful human genome sequencing even after the machine starts making reads, let alone enough for 30X coverage.

The Technology Review article, a journal I once read regularly and had significant respect for, is a depressingly full of sloppy journalism & failure to understand the topic. One paragraph has two doozies

Because the droplets are so small, they require much smaller volumes of the chemicals used in the sequencing reaction than do current technologies. These reagents comprise the major cost of sequencing, and most estimates of the cost to sequence a human genome with a particular technology are calculated using the cost of the chemicals. Based solely on reagents, Weitz estimates that they will be able to sequence a human genome 30 times for $30. (Because sequencing is prone to errors, scientist must sequence a number of times to generate an accurate read.)

The first problem here is that yes, the reagents are currently the dominant cost. But, if library construction costs are somewhere in the $200-500 range, then after you drop reagents greatly below that cost then it's a bit dishonest to tout (and poor journalism to repeat) a $30/human genome figure. Now, perhaps they have a library prep trick up their sleeve or perhaps they can somehow go with a Helicos-style "look Ma no library construction" scheme. Since they have apparently not settled on a chemistry (which will also almost certainly impose technology licensing costs -- or developing a brand new chemistry -- or getting the Polonator chemistry, which is touted as license-free), anything is possible -- but I'd generally bet this will be a clonal sequencing scheme requiring in-droplet PCR. The second whopper there is the claim that the 30X coverage is needed for error detection. It certainly doesn't hurt, but even with perfect reads you still need to oversample just to have good odds of seeing both alleles in a diploid genome.

Just a little alter in the story is the claim "The current cost to sequence a human genome is just a few thousand dollars, though companies that perform the service charge $20,000 to $48,000", which confuses what one company (Complete Genomics) may have achieved with what all companies can achieve.

The other half of the business plan I find even less appealing. They are planning to offer anyone a deal: pay your own way or let us do it, but if we do it we get full use of the data after some time period. The thought is that by building a huge database of annotated sequence samples, a business around biomarker discovery can be built. This business plan has of course been tried multiple times (Incyte, GeneLogic, etc.) and has worked in the past.

Personally, I think whomever is buying into this plan is deluding themselves in a huge way. First, while some of the articles seem to be confident this scheme won't violate the consent agreements on samples, it's a huge step from letting one institution work with a sample to letting a huge consortium get full access to potentially deidentifying data. Second, without good annotation the sequence is utterly worthless for biomarker discovery; even with great annotation randomly collected data is going to be challenging to convert into something useful. Plus, any large scale distribution of such data will butt up against the widely accepted provision that subjects (or their heirs) can withdraw consent at any time.

The dream gets (in my opinion) just daffier beyond that -- subjects will be able to be in a social network which will notify them when their samples are used for studies. Yes, that might be something that will appeal to a few donors, but will it really push someone from not donating to donating? It's going to be expensive to set up & potentially a privacy leakage mechanism. In any case, it's very hard to see how that is going to bring in more cash.

My personal advice to the company is many-fold. First, ditch all those crazy plans around forming a biomarker discovery effort; focus on building a good tech (and probably selling it to an established player). Second, focus on RNA-Seq as your initial application -- this is far less demanding in terms of read length & will allow you to start selling instruments (or at least generating data) much sooner, giving you credibility. Of course, without some huge drops the cost of library construction will be dwarfing that $30 in reagent, perhaps by a factor of 10X. A clever solution there using the same picodroplet technology will be needed to really get the cost of a genome to low levels -- and could be cross-sold to the other platforms (and again, perhaps a source of a revenue stream while you work out the bugs in the sequencing scheme).

Finally, if you really can do an RNA-Seq run for $30 a run in total operating costs, could you drop an instrument by my shop?

Monday, December 28, 2009

Length matters!

I was looking through part of my collection of papers using Illumina sequencing and discovered an unpleasant surprise: more than one does not seem to state the read length used in the experiment. While to some this may seem trivial, I had a couple of interests. First, it's useful for estimating what can be done with the technology, and second since read lengths have been increasing it is an interesting guesstimate of when an experiment was done. Of course, there are lots of reasons to carefully pick read length -- the shorter the length, the sooner the instrument can be turned over to another experiment. Indeed, a recent paper estimates that for RNA-Seq IF you know all the transcript isoforms then 20-25 nucleotides is quite sufficient and you are interested in measuring transcript levels (they didn't, for example, discuss the ideal length for mutation/SNP discovery). Of course, that's a whopping "IF", particularly for the sorts of things I'm interested in.

Now in some cases you can back-estimate the read length using the given statistics on numbers of mapped reads and total mapped nucleotides, though I'm not even sure these numbers are reliably showing up in papers. I'm sure to some authors & reviewers they are tedious numbers of little use, but I disagree. Actually, I'd love to see each paper (in the supplementary materials) show their error statistics by read position, because this is something I think would be interesting to see the evolution of. Plus, any lab not routinely monitoring this plot is foolish -- not only would a change show important quality control information, but it also serves as an important reminder to consider the quality in how you are using the data. It's particularly surprising that the manufacturers do not have such plots prominently displayed on their website, though of course those would be suspected of being cherry-picked. One I did see from a platform supplier had a horribly chosen (or perhaps deviously chosen) scale for the Y-axis, so that the interesting information was so compressed as to be nearly useless.

I should have a chance in the very near future to take a dose of my own prescription. On writing this, it occurs to me that I am unaware of widely-available software to generate the position-specific mismatch data for such plots. I guess I just gave myself an action item!

Sunday, May 31, 2009

Teasing small insertion/deletion events from next-gen data

My interest in next-generation sequencing is well on the way from shifting from hobby to work-central, which is exciting. So I'm now really paying attention to the literature on the subject.

One of the interesting uses for next-generation sequencing is identifying insertion or deletion alleles (indels) in genomes, particularly the human genome. Of course, the best way to do this is to do a lot of sequencing, compare the sequence reads against a reference genome, and identify specific insertions or deletions in the reads. However, this is generally going to require a full genome run & a certain amount of luck, especially in a diploid organism as you might not sample both alleles enough to see a heterozygous indel. A cancer genome might be even worse: these often have many more than two copies of the DNA at a given position and potentially there could be more than two different versions. In any case, full genome runs are in the ballpark of $50K, so if you really want to look at a lot of genomes a more efficient strategy is needed.

The most common approach is to sequence both ends of a DNA molecule and then compare the predicted distance between those ends with the distance on the reference genome. If you know the distribution of lengths that the sequence library has, then you can spot cases where the length on the reference is very different. In effect, you've lengthened (but made less precise) your ruler for measuring indels, and so you need many fewer measurements to find them.

One aside: in a recent Cancer Genomics webinar I watched a distinction was made between "mate pairs" and "paired ends" -- except now I forget which they assigned to which label (and am too lazy/time strapped to watch the webinar right now). In short, one is the case of sequencing both ends of a standardly prepared next-generation library, and the other involves snipping the middle out of a very large fragment to create the next-gen sequencing target. Here I was prepared to go pedantic and I'm caught napping!

Of course, that is if you know the distribution of DNA insert sizes. While you might have an estimate from the way the library is prepared, an obvious extension would be to infer the library's distribution from the actual data. An even more clever approach would be to use this distribution to pick out candidates in which the paired end sequences lie well within the distribution, but are consistently shifted relative to that distribution.

A paper fresh out of Nature Methods (subscription required & no abstract) incorporates precisely these ideas into a program called MoDIL. The program also explicitly models heterozygosity, allowing it to find heterozygous indels.

In performance analysis on actual human shotgun sequence, the MoDIL paper claims 95+% sensitivity for detecting indels of >=20bp. I tfor library used, this is detecting 10% length difference (insert size mean: 208; stdev: 13). The supplementary materials also look at the ability to detect heterozygous deletions of various sizes as a function of genome coverage (the actual sequencing data used had 120X clone coverage, meaning the average nucleotide in the genome would be found in 120 DNA fragments in the sequencing run). Dropping the coverage by a factor of 3 would be expect to still pick up most indels of >=40.

Lee, S., Hormozdiari, F., Alkan, C., & Brudno, M. (2009). MoDIL: detecting small indels from clone-end sequencing with mixtures of distributions Nature Methods DOI: 10.1038/nmeth.f.256

Wednesday, April 15, 2009

Sequencing's getting so cheap...

Here's a decidedly odd gendanken experiment which illustrates what next-gen sequencing is doing to the ocst.

A common way of deriving the complete sequence of a large clone is shotgun sequencing -- the clone is fragmented randomly into lots of little fragments. With conventional (Sanger) sequencing these fragments are cloned, clones are picked and each clone sequenced. By using a universal primer (or more likely primer pair; one read from each end), a lot of data can be generated cheaply.

If you search online for DNA sequencing, a common advertised cost is $3.50 per Sanger read. This probably doesn't include clone picking or library construction, but we'll ignore that. Read lengths vary, but to keep the math simple lets say we average 500 nucleotide reads, which from my experience is not unreasonable, though very good operations will routinely get longer reads.

So, at that price and read length it's $7.00 per kilobase of raw data. For shotgunning, collecting 10X-20X coverage is quite common and likely to give a reasonable final assembly, though higher is always better. At 10X coverage, that means for each 1Kb of original clone we'll spend $70.00.

Suppose we have an old cosmid -- which is about 50Kb of DNA including the vector. So to shotgun sequence it with Sanger sequencing, if building & picking the library were free, would be around $5200 for 15X coverage. Pretty cheap, right?

Except, for a measly $4700 you can have next gen sequencing of it (and that actually includes library construction costs). 680Mb of next gen sequencing -- or 1172X coverage. Indeed, if you left the E.coli host DNA in you'd still have well in excess of 100X coverage of E.coli plus your cosmid. So if you had multiple cosmids, you could actually get them sequenced for the same price, assuming you can distinguish them at the end (or they just assemble together anyway)!

Sequencing so cheap you can theoretically afford 99% contamination! Yikes!

Of course, it's unlikely you'd really want to be so profligate. Rather than resequence E.coli, you could pack a lot of inserts in. But it does underline why Sanger sequencing is quickly being relegated to a few niches (for example, when you need to screen clones in synthetic biology projects) & the price of used capillary sequencers is reputed to going south of $30K.

Friday, September 28, 2007

A little follow-up

Sometimes soon after writing something I see something related to the post, but I've been lousy about doing anything about it. But this week, particularly with a desire to recognize the generosity of friends & strangers, I will.

Following my mention of the Cambridgeport chickens, the Boston Globe mentioned some more free-range biotechophilic poultry: there is a wild turkey living near Kendall Square in the vicinity of Biogen Idec. I've seen one gobbler down near North Station, but never there -- which is mildly irritating since used to walk through there a lot. I'm also reminded of the flock of enormous feral white geese that hang out down by my old haunt of 640 Memorial Drive, though for some reason they don't stir much affection from me (though they have some very passionate defenders every time the parks department suggests thinning the flock)

One of our summer interns stopped by for a visit & with a grin announced she had a present for me. I was quite mystified -- and then startled & thrilled to see the nanopore paper in her hands

Lingvitae AS

A reader was kind enough to use a comment to point out another article on the new Genome Corp, and indeed showing a bit more conscious connection to the original. The article is still spotty on details, but does have two tidbits. First, is a strong hit that electrophoretic separation is still in play here -- but presumably on a very micro scale -- perhaps on-chip? Second, Ulmer wants to set up a very highly optimized DNA factory, not a company selling machines or kits. Pondering different genome sequencing business models is at least a post in itself, but since I currently work in a highly industrialized DNA factory it does hold some resonance

Thursday, September 27, 2007

What exactly is Sanger sequencing?

Today's GenomeWeb contained an item on yet another genome sequencing startup, Genome Corp (which was the name proposed for the first genome sequencing company). Genome Corp is being started by Kevin Ulmer, who has been involved in a number of prior companies (for a quite effusive description, see the full press release).

Ulmer is an interesting guy. I heard him speak at a commercial conference once & he had the chutzpah to put up the famous Science 'and then a miracle occurs' cartoon with reference to his competition, and then launch into a description of his own blue sky technology. If I remember correctly, it involved capturing nucleotides chewed off by exonuclease & then cooling them to the liquid helium range. Not that it can't be done, but it wasn't actually a high school science fair project either.

The technology is described as "Massively Parallel Sanger Sequencing", with the comment that Sanger is responsible for 99+% of the DNA sequences deposited in GenBank. I hadn't actually thought about it before, but I probably annotated somewhere north of 50% of the bases generated by what would have been the runner-up method a few years back, Maxam-Gilbert, due to some genome sequencing projects run while I was a graduate student. Multiplex Maxam-Gilbert sequencing actually knocked off a few bacterial genomes, but those were corporate projects which were kept proprietary (by Genome Therapeutics, whose corporate successor is called Oscient) -- and those were annotated by a version of my software. Never thought to toot that horn before!

It also reminded me that somewhere I saw one of the other next-generation technologies (I think it was Solexa's) described as Sanger sequencing. Which leads to the title question: what is the essence of Sanger sequencing? I generally think of it as electrophoretic resolution of dideoxy-terminated fragments, but if you think a bit it's obvious that Sanger's unique contribution was the terminators; Maxam-Gilbert used the same electrophoretic separation. So, by that measure, Solexa's method is a Sanger method. On the other hand, ABI's SOLID isn't (ligase, no terminators) nor is 454's (no terminators). 454 could be accomodated by stretching the definition to using a polymerase and unbalanced nucleotide mixtures to sequence DNA, but that seems a real stretch.

The press release didn't really give much away, and a patent search on freepatents didn't find something quickly (though it did find another scheme of Ulmer's using aptamers, a periodic idee fixe of mine) There have been publications describing minaturized, microfluidic Sanger sequencing schemes retaining size separation as a component (e.g. this one in PNAS [free!]), so perhaps its in that category.

The funding announced is from a public (or quasi-public) fund supporting new technology in Rhode Island. It's not really commuting range for me (not that I'm looking for a change), but it is nice to see more such companies in the neighborhood. There's at least one other Rhode Island based next-next generation sequencing startup I've seen, so perhaps the smallest state will yield the biggest genomes!

Tuesday, September 25, 2007

A First Commercial Nanopore Foray?

Today's GenomeWeb carried the news that Sequenom has licensed a bit of nanopore technology with the intent of developing a DNA sequencer with it. The press release teases us with the possibility of sub-kilodollar human genomes.

Nanopores are an approach which has been around for at least a decade-and-a-half -- a postdoc was working on it when I showed up in the Church lab in 1992. The general concept is to observe single nucleic acid molecules traversing through a pore. It's a great concept, but has proven difficult to turn into reality. I'm unaware of a true proof-of-concept publication showing significant sequence reads using nanopores, though I won't claim to have really dug in the literature. Even such an experiment would represent a small step but not an imminent technology -- the first polony sequencing paper was in 1999 and only in the last few years has that approach really been made to work.

Which is one reason I'm a bit apprehensive as to who bought the technology. Sequenom has done interesting things and has a great name (I had independently thought of it before the company formed; if only I had thought to cybersquat!). But, they have had a rough time in the marketplace, and were even threatened with NASDAQ delisting a bit over a year ago. Their stock has climbed from that trough, but they're hardly flush: only $33M in the bank and still burning cash at a furious rate. Can Sequenom really invest what it will take to bring nanopores to an operational state, or will nanopores be stuck with a weak dance partner which steps on its toes? I hope they pull it off, but it's hard to be optimistic.

It would also be nice to learn more about the technology. I found the most recent publication of the group, but it is (alas!) in a non-open access journal (Clinical Chemistry, though oddly Entrez claims it is). I might spring the $15 to read it, but that's not exactly a good habit to get into. The most enticing bit in that the current version apparently relies on generating cleverly-labeled DNA polymers that somehow transfer the original sequence information ("Designed DNA polymers") and then detecting the sequence due to passage through the nanopore activating the labels. It sounds clever, but moves away from the original vision of really, really long read lengths by reading DNA directly through the nanopore. The question then becomes how accurate is that conversion process and what sorts of artifacts does it generate?

Saturday, September 15, 2007

Scoping out DNA

One of this week's GenomeWeb items mentioned an extension of the research agreement between ZS Genetics and the University of New Hampshire. I've heard their head honcho speak a couple of times, and ZS Genetics should be interesting to watch. They propose to (nearly) directly sequencing DNA using electron microscopy. Because DNA, like most organic materials, isn't very opaque to electrons they have a proprietary labeling scheme to label the DNA in a nucleotide-specific manner. Electron microscopy is essentially monochromatic, but if I remember correctly the concept is to grey-scale code the various nucleotides.

One attraction of this sort of scheme is a vision of very, very long read lengths -- the ZS Genetics talks mention 20Kb or so. Such long reads have all sorts of enticing applications, from reading through very complex repeat structures to directly reading out long haplotypes.

The devil, of course, is actually doing this. The data I've seen so far suggests that this approach is definitely in the next-next gen category, along with various other imaging schemes, microcantilevers & nanopores. ZS recognizes that they have a ways to go & propose near-term applications in single-cell gene expression analysis. The danger is that they end up stalled there or worse.

Wednesday, August 08, 2007

Too good to be true?

A recent GenomeWeb item stated (digested from a press release) that GATC Biotech in Germany is one of the first customers for ABI SOLiD sequencing-by-ligation instrument. This machine will complement the Roche 454 FLX and Illumina/Solexa 1G which GATC already has in house, meaning that GATC has all three launched next-generation sequencing instruments.

The eyebrow-raiser in the press release is

the SOLiD™ System is expected to be installed in early autumn this year and will boost the company's current sequencing capacity from 130 gigabases to 250 gigabases a year.

Nearly doubling capacity with one SOLiD instrument in a shop that already has a 1G and an FLX? If that number is really the impact of the SOLiD, then ABI is taking a huge lead in total reads. Of course, actual performance may vary from projections. Even if that is the joint contribution of the 3 next-gen sequencers, it would underscore what an advance they are -- especially considering how much up-front sample preparation & management work can be jettisoned in comparison to feeding a conventional sequencer.

(Disclosure: my company may be in the market for such services, and I would probably be one of the decision makers in such a decision)

Thursday, June 07, 2007

Illumina-ting DNA-protein interactions

The new Science (sorry, you'll need a subscription beyond the abstracts) has a bunch of genomics papers, but the one closest to my heart is a paper from Stanford and Cal Tech using the Illumina (ex-Solexa) sequencing platform to perform human genome-wide mapping of the binding sites for a particular DNA-binding protein.

One particular interesting angle on this paper is actually witnessing the beginning of the end of another technique, ChIP Chip. Virtually all of the work in this field relies on using antibodies against a DNA-binding protein which has been chemically cross-linked to nearby DNA in a reversible way. This process, chromatin immunoprecipitation or ChIP, was married with DNA chips containing potential regulatory regions to create ChIP on Chip, or ChIP Chip.

It is a powerful technique, but with a few limitations. First, you can only see binding to what you put on a chip, and it isn't practical to put more than a sampling of the genome on a chip. So, if you fail to put the right pieces down, you might miss some interesting stuff. This interacts in a bad way with a second consideration: how big to shear the DNA to. A key step I left out in the ChIP description above is the mechanical shearing of the DNA into small fragments. Only those fragments bound to your protein of interest should be precipitated by the antibody. The smaller your sheared fragment size, the better your resolution -- but also the greater risk that you will successfully precipitate DNA that doesn't bind to any of your probes.

A stepping stone away from ChIP Chip is to clone the fragments and sequence them, and several papers have done this (e.g. this one). The new paper ditches cloning entirely and simply sequences the precipitated DNA using the Illumina system.

With sequencing, your ability to map sites will now be determined by the ability to uniquely identify sequence fragments and again the size distribution of your shattered DNA. Illumina has short read lengths, but the handicap imposed by this is often greatly overestimated. Computational analyses have shown that many short reads are still unique in the genome, and assemblers capable of dealing with whole-genome shotgun of complex genomes with short reads are starting to show up. One paper I stumbled on while finding references for this post includes Pavel Pevzner as an author, and I always find myself much wiser after reading a Pevzner paper (his paper on the Eulerian path method is exquisitely written).

In this paper, read length of 25 nt were achieved, and about 1/2 of those were uniquely mappable to the genome, allowing for up to 2 mismatches vs. the reference sequence. Tossing 50% of your data is frustrating, but with 2-5 million reads in the experiment, you can tolerate some loss. These uniquely mapped sequences where then aligned to each other to identify sites marked by multiple read. 5X enrichment of a site vs. a control run were required to call a positive.

One nice bit of this study is that they chose a very well studied DNA-binding protein for the study. Many developers of new techniques rush for the glory of untrodden paths, but going after something unknown strongly constrains your ability to actually benchmark the new technique. Because the site they went after (NRSF) is well characterized, they could also compare their results to relatively well-validated computational methods. For 94% of their sites, the called peak from their results was within 50nt of the computationally defined site. They also achieved an impressive 87% sensitivity (ability to detect true sites) and 98% specificity (ability to exclude false sites) when benchmarked against well-characterized true positives and known non-binding DNA sites. A particularly interesting claim is that this survey is probably comprehensive and has located all of the NRSF/REST sites in the genome, at least in the cell line studied. This is attributable to the spectacular sequencing depth of the new platforms.

Of course, this is one study with one target and one antibody in one cell line. Good antibodies for ChIP experiments are a challenge -- finding good antibodies in general remains a challenge. Other targeted DNA-binding proteins might not behave so well. On the other hand, improvements in next generation sequencing technologies will enable more data to be collected. With paired-end reads from the fragments, perhaps a significant amount of the discarded 50% of the data could be salvaged as uniquely mappable. Or, just go to even greater depths. Presumably some clever computational algorithms will be developed to tease out sites which are hiding in the repetitive portions of the genome.

It is easy to imagine that in the next few years this approach will be used to map virtually all of the binding sites for a few dozen transcription factors of great interest. Ideally, this will happen in parallel in both human and other model systems. For example, it should be fascinating to compare the binding site repertoire of Drosophila p53 vs. human p53. Another fascinating study would be to take some transcription factors suggested to play a role in development and scan them in multiple mammalian genomes, yielding a picture of how transcription factor binding has changed with different body plans. Perhaps such a study would reveal the key transcription factor changes which separate our development from those of the non-human primates. The future is bound to produce interesting results.

Monday, June 04, 2007

SOLiD-ifying the next generation of sequencing

ABI announced today that it has has started delivering its SOLiD next generation sequencing instruments to early access customers and will take orders from other customers (anyone want to spot me $600K?). SOLiD uses the ligation sequencing scheme developed by George Church and colleagues.

Like most of the current crop of next generation sequencers (that is, those which might see action in the next couple of years), SOLiD utilizes the clonal amplification of DNA on beads.

One interesting twist of the SOLiD system is that every nucleotide is read twice. This should guarantee very high accuracy. Every DNA molecule on a given bead should have exactly the same sequence, but by having such redundancy one can reduce the amount of DNA on each bead -- meaning the beads can be very small.

Bio-IT World has a writeup on next generation sequencing that focuses on SOLiD (free!). They actually cover the wet side a surprising amount for an IT-focused mag, and even have photos of the development instrument. An interesting issue that the article brings up is that each SOLiD run is expected to generate one terabyte of image data. The SOLiD sequencer will come with a 10X dual core Linux cluster sporting 15 terabytes of storage. This is a major cost component of the instrument -- though it is worth noting that the IT side will be on the same spectacular performance/cost curve as the rest of the computer industry -- it's pointed out that 5 years ago such a cluster would be one of the 500 most power supercomputers in the world; in a handful of years I'll probably be requisitioning laptop with similar power.

That still is a lot of data per run, and in contrast the top-line 454 FLX generates only 13 gigabytes of images per run - so there still is an opportunity to develop a 454 trace viewer that runs on a video iPod! A side-effect of this deluge of image data is that ABI is expecting that users will not routinely archive their raw images, but instead let ABI's software digest them to reads and only save the reads. That's an audacious plan, as with the other sequencers and with fluorescent sequencing before that archiving was pretty standard -- at Millennium we had huge amounts of space devoted to raw traces & NCBI and the EBI have enormous trace archives also. The general reason for archiving the traces is that better software might show up later to read the traces better. SOLiD customers will be faced with either ditching that opportunity or paying through the nose for tape backup of each run.

Since a lot of the same labs are early access customers for the same instruments, one can hope that some head-to-head competitions will ensue to look at cost, accuracy and real throughput. ABI is claiming SOLiD will generate over 1 Gigabase per run, and Illumina/Solexa named their sequencer for similar output (the '1G'), whereas Roche/454 is quoted more in the 0.4/0.5Gb /run range. Further evolutionary advances of all the platforms are to be expected. For SOLiD, that will mean packing the beads tighter and minimizing non-productive beads (those with either zero or more than one DNA species). In the Church paper, an interesting performance metric was introduced: bits of sequence read per bit of image generated -- in the paper it was 1/10000 -- and the goal of a 1:1 ratio was proposed.

In any case, the density of data achievable is spectacular -- one of my favorite figures of all time is Figure 3B in the Church paper, which shows uses sequencing data to determine the helical pitch of DNA! The ABI press release mentions using SOLiD to identify nucleosome positioning motifs in C.elegans, and I recently saw an abstract which used 454 to hammer on HIV integration sites to work out their subtle biases. Ultra-deep, ultra-accurate sequencing will generate all sorts of novel biological assays. One can imagine simultaneously screening whole populations for SNPs or going very deep within a tumor genome for variants. Time to pull up a chair, grab a favorite beverage, and watch the fireworks!

Tuesday, December 19, 2006

Next-Gen Sequencing Blips

Two items on next generation sequencing that caught my eye.

First, another company has thrown its hat in the next generation ring: Intelligent Bio-Systems. As detailed in GenomeWeb, it's located somewhere here in the Boston area & is licensing technology from Columbia.

The Columbia group last week published a proof-of-concept paper in PNAS (open access option, so free for all!). The technology involves using reversible terminators -- the labeled terminator blocks further extension, but then can be converted into a non-terminator. Such a concept has been around a long time (I'm pretty sure I heard people floating it in the early-90's) & apparently is close to what Solexa is working on, though Solexa (soon to be Illumina) hasn't published their tech. One proposed advantage is that reversible terminators shouldn't have problems with homopolymers (e.g. CCCCCC) whereas methods such as pyrosequencing may -- and the paper contains a figure showing the contrast in traces from pyrosequencing and their method. The company is also claiming they can have a much faster cycle time than other methods. It will be interesting to see if this holds out.

Given the very short reads of many of these technologies, everyone knows they won't work on repeats, right? It's nice to see someone choosing to ignore the conventional wisdom. Granger Sutton, who spearheaded TIGR's & then Celera's assembly efforts, has a paper in Bioinformatics describing an assembler using suffix trees which attempts to assemble the repeats anyway while assuming no errors -- but with a high degree of oversampling that may not be a bad assumption. They report significant success:

We ran the algorithm on simulated
error-free 25-mers from the bacteriophage PhiX174 (Sanger, et al.,
1978), coronavirus SARS TOR2 (Marra, et al., 2003), bacteria
Haemophilus influenzae (Fleischmann, et al., 1995) genomes and
on 40 million 25-mers from the whole-genome shotgun (WGS)
sequence data from the Sargasso sea metagenomics project
(Venter, et al., 2004). Our results indicate that SSAKE could be
used for complete assembly of sequencing targets that are 30 kbp
in length (eg. viral targets) and to cluster millions of identical short
sequences from a complex microbial community.

Wednesday, November 22, 2006

What's a good 1Gbase to sequence?

My newest Nature arrived & has on the front a card touting Roche Applied Science's 1Gbase grant program. Submit an entry by December 8th (1000 words), and you (if you reside in the U.S. or Canada) might be able to get 1Gbase of free sequencing on a 454 machine. This can be run on various number of samples (see the description). They are guaranteeing 200bp per read. The system runs 200Kreads per plate and the grant is for 10 plates -- 2M reads -- but 2M x 200 bases = 400Mb -- so somewhere either I can't do math or their materials aren't quite right. The 200bp/read is a minimum, so apparently their average is quite a bit higher (or again, I forgot a factor somewhere). Hmm, paired end sequencing is available but not required, so that isn't the obvious factor of 2.

So what would you do with that firepower? I'm a bit embarassed that I'm having a hard time thinking of good uses. For better-or-worse, I was extended at Millennium until the end-of-year, so any brainstorms around cancer genomics can't be surfaced here. There are a few science-fair like ideas (yikes! will kids soon be sequencing genomes as science fair projects?), such as running metagenomics on slices out of a Winogradski column. When I was an undergrad, our autoclaved solutions of MgCl2 always turned a pale green -- my professor said this was routine & due to some alga that could live in that minimal world. Should that be sequenced? Metagenomics of my septic system? What is the most interesting genetics system that isn't yet undergoing a full genome scan?

Well, submission is free -- so please submit! Such an opportunity shouldn't be passed up lightly.

Thursday, November 09, 2006

Betty Crocker Genomics

It is one thing to eagerly follow new technologies and muse about their differences, it is quite different to be in the position of playing the game with real money. In the genome gold rush years it was decided we needed more computing power to deal with searching the torrent of DNA sequence data, and so we started looking at the three then-extant providers of specialized sequence analysis computers. But how to pick which one, with each costing as much as a house?

So, I designed a bake-off: a careful evaluation of the three machines. Since I was busy with other projects, I attempted to define strict protocols which each company would follow with their own instrument. The results would be delivered to me within a set timeframe along with pre-specified summary information. Based on this, I would decide which machines met the minimal standard and which was the best value.

Designing good rules for a bake-off is a difficult task. You really need to understand your problem, as you want the rules to ensure that you get the comparative data you need in all the areas you need it, with no ambiguity. You also want to avoid wasting time drafting or evaluating criteria that aren't important to your mission. Of key importance is to not unfairly prejudice the competition against any particular entry or technology -- every rule must address the business goal, the whole business goal, and nothing but the business goal.

Our bake-off was a success, and we did purchase a specialized computer which sounded like a jet engine when it ran (fans! it ran hot!) -- but it was tucked away where nobody would routinely hear it. The machine worked well until we no longer needed it, and then we retired it -- and not long after the manufacturer retired from the scene, presumably because most of their customers had followed the same path as us.

I'm thinking about this now because a prize competition has been announced for DNA sequencing, the Archon X Prize. This is the same organization which successfully spurred the private development of a sub-orbital space vehicle, SpaceShip One. For the Genome X Prize, the basic goal is to sequence 100 diploid human genomes in 10 days for $1 million.

A recent GenomeWeb article described some of the early thoughts about the rules for this grand, public bake-off. The challenges in simply defining the rules are immense, and one can reasonably ask how they will shape the technologies which are used.

First off, what exactly does it mean to sequence 100 human genomes in 1 week for $1 million? Do you have to actually assemble the data in that time frame, or is that just the time to generate raw reads & filter them for variations? Can I run the sequencer in a developing country, where the labor & real estate costs are low? Does the capital cost of the machine count in the calculation? What happens to the small gaps in the current genome? To mistakes in the current assembly? To structural polymorphisms? Are all errors weighted equally, and what level is tolerable? Does every single repeat need to be sequenced correctly?

The precise laying down of rules will significantly affect which technologies will have a good chance. Requiring that repeats be finished completely, for example, would tend to favor long read lengths. On the other hand, very high basepair accuracy standards might favor other technologies. Cost calculation methods can be subject to dispute (e.g. this letter from George Church's group).

One can also ask the question as to whether fully sequencing 100 genomes is the correct goal. For example, one might argue that sequencing all of the coding regions from a normal human cell will get most of the information at lower cost. Perhaps the goal should be to sequence the complete transcriptomes from 1000 individuals. Perhaps the metagenomics of human tumors is what we really should be shooting for -- with appropriate goals for extreme sensitivity.

Despite all these issues, one can only applaud the attempt. After all, Consumer Reports does not review genomics technologies! With luck, the Genome X Prize will spur a new round of investment in genomics technologies and new companies and applications. Which reminds me, if anyone has Virgin Galactic tickets they don't plan to use, I'd be happy to take them off your hands...

Omics! Omics!