Thursday, February 08, 2007

New fangled DNA sequencing

In one of his Week of Science posts PZ Myers over at Evolgen discusses how shotgun genome sequencing works (there is also a great primer on sequencing over at Discovering Biology in a Digital World) . He explicitly covers 'traditional' fluorescent Sanger sequencing & avoids (in his words) "some new fangled techniques". I'll take the bait & give a sketch of the leading class of new fangled methods.

For a quick review, there are three general steps to Sanger sequencing (and the Maxam-Gilbert method that was developed around the same time, but has faded away).

  1. Prepare a sample containing many copies of the same DNA molecule. These will usually be molecules cloned in an E.coli plasmid or phage, but could be from PCR using defined primers.
  2. Generate from that pool a set of DNA fragments which all start at the same place and end on a specific nucleotide; this can either be in 4 separate pools or you have a different color tag denoting which letter the fragment ends in
  3. Sort the fragments by size using electrophoresis. The temporal or spatial sequence of these fragments (really the same thing, but your detection strategy probably uses one or the other) gives you the sequence of the nucleotides.

Each of these steps has serious issues affecting throughput & cost

  1. Cloning in bacteria is labor intensive, whether that labor is humans or robots. It takes time & space to grow all those bacteria, and more time, effort & materials to get the DNA out and ready for sequencing
  2. This step actually isn't so bad, except it implies the next step...
  3. Electrophoresis requires space, even if you miniaturize it; when you image your sequencing ladder, you aren't getting many basepairs per bit of image. DNA doesn't always migrate predictably; hairpins and other secondary structures can distort things

A host of companies are working on new approaches. Most next generation sequencing technologies are generally described below, a class of methods known as sequencing by synthesis. The deviations from the description and some of the variation described provide the opportunities for the many players. Only one commercial instrument, developed by 454 Corporation & marketed by Roche Molecular Systems, has published major data (as far as I know; corrections welcome). George Church's lab has published data with their kit (do-it-yourself types can build one of their own with their directions), which a commercial entity (now part of ABI) is attempting to package up and improve.

  1. Fragment your DNA into lots of little pieces & put standard sequences (linkers) onto the ends of the sequences.
  2. Isolate individual DNA molecules, but without going into bacteria. Instead:

  3. Mix dilute DNA with beads, primers (which recognize the linkers) & PCR mix. By mixing these with oil in the right way, you can turn each bead into its own little reaction chamber -- a layer of buffer (with all the goodies) over the bead and encapsulated by the oil. Many beads will have attracted no DNA molecules, and some more than one. Both of these will be filtered out later.
  4. You can now amplify these individual DNAs without the products contaminating each other. One strand of the amplified DNA is stuck to the beads.
  5. Prepare a population of beads which each originated with a single DNA molecule
  6. Strip off the oil, used up PCR buffer, and the strand not fixed to the bead. Each bead now contains a population of DNA molecules all originating from a single starting molecule.

  7. Pack the beads into a new reaction chamber which is also an imaging device. You now have 400K to millions of beads on a slide.
  8. Anneal a short oligo primer to the sequences -- probably a primer binding
  9. Interrogate the DNA one position at a time to find out what nucleotide is present.

The details, and deviations, are what set the methods apart. For example, some methods require four steps to interrogate a position -- once for each nucleotide. In effect, the system 'asks' each molecule 'is the next nucleotide an A?' In other schemes, the question is in effect 'which is the next nucleotide?' -- all four are read out simultaneously. For many of the schemes which need to ask four times per nucleotide, the different strands may not be read synchronously. For example, if the first query is 'A' and the second 'C', then all sequences starting with 'A' are read on the first query -- and any which started 'AC' will have a C read on the second query.

The enzyme running the interrogation is either DNA polymerase or DNA ligase. Most schemes use polymerase, but others use ligase with a set of special degenerate oligo pools (collections of short DNAs in which some positions have fixed sequence and others can be any of the four nucleotides).

Detection schemes vary. Here are some of them

  • 454 uses pyrosequencing, which takes advantage of the fact that when polymerase adds a nucleotide to DNA, it kicks off a pyrophosphate. A clever coupled series of enzymatic reactions can cause the pyrophosphate to trigger a light flash. Since pyrophosphate looks the same no matter which nucleotide it comes from, pyrosequencing is inherently a four-queries per position method.
  • Many methods use fluorescently labeled nucleotides, with the fluorescence read by a high-powered microscope. You then need to get rid of that fluorescence, or it will contaminate the next read. One option is to blast it with more light to photobleach (destroy) the label. For ligation based sequencing, the labeled oligo can be stripped off with heat. One group uses clever nucleotides where the fluorescent group can be chemically cleaved.
  • Some groups are using some very clever optical strategies to watch single polymerases working on single DNAs; these methods claim to be able to sequence single native DNAs with no PCR

Some other things to look for in descriptions of sequencing by synthesis schemes

  1. What's the read length? Short reads can be very useful, but for de novo sequencing of genomes long is much better.
  2. How many reads per run? For applications such as counting mRNAs, this may be much more important than read length. Read length x #reads = total bases per run, which can be astounding. 454 is claiming in their adds 400Mb per run, and Solexa (now Illumina) is shooting for 1Gb per run. Since Solexa reads are about 1/16th the length (roughly 25 vs 400), that means Solexa is packing a lot more beads.
  3. Can it reliably read runs of the same letter (homopolymers) -- some methods have trouble with these, others do quite well.
  4. What is the accuracy of individual reads? For some applications, such as looking for mutations in cancer genomes, the detection sensitivity is directly determined by the read accuracy. Other applications are not as sensitive, but it is still an important parameter
  5. Can the method generate paired end reads (one read from each end of the same molecule)? This is handy for many applications (such as denovo sequencing), and essential for others.
  6. Run time. How long is a cycle & how many cycles per run? For some applications, it may be better to get another sample going than to run out the full read length (Incyte used this approach to great effect in generating their EST libraries)
  7. Cost to buy? -- well, if you want to own one.
  8. Cost to operate? -- well, if you want to do more than just brag you own one

One interesting trend is a gradual escalation of the benchmark for showing that your next generation technology is for real. While many groups first publish a paper showing the basic process, real validation comes with sequencing something big. 454 & Church sequenced bacterial genomes at first, but Solexa reported (at a meeting, not yet in print) using a few runs of their sequencer to run a human X-chromosome.

Another way to look at it is this: Celera & the public project spent multiple years generating shotgun sequence drafts for the human (and later mouse, dog, opposum, chimp ... and now horse) genomes. To get respectable 12X coverage of a 4Gb genome, you need 48Gb of data -- or about 2 months of running a Solexa machine or 4 months on the 454 (if 1 run/day -- I'm not sure of the run times). YOW!

Of course, there are other technologies competing in the next generation space as well. Most are still sketchy but intriguing (I heard one briefly presented this evening). The current round of sequencing-by-synthesis technologies are expected to bring the price of a human genome down to a few $100K, so to reach the oft-quoted goal of $1K genomes either a lot of technological evolution will be needed -- or another technological revolution.


RPM said...

Thanks for the plug, but I'm not PZ Myers :)

I'm hoping that 454 and similar technologies will revolutionize de novo sequencing in eukaryotes, but the issue of paired end reads will cause problems. Also, the short read lengths will be troublesome for repetitive eukaryotic genomes.

I think 454 could be useful for resequencing in eukaryotes, but the accuracy of 454 reads may interfere with the purpose of resequencing projects (accurate detection of polymorphism). For example, 454 runs into problems with mononucleotide repeats.

PZ Myers said...

We are all PZ Myers over at ScienceBlogs.

Keith Robison said...

Sigh. The risk of working on a blog with too little sleep and too much deadline pressure.