Even when I was finishing up as a graduate student, and only a few bacterial genomes had been published, one would periodically hear open speculation as to when the top journals would quit accepting genome sequencing papers. The thought was that once the novelty wore off, a genome would need to be increasingly difficult or have some very odd biology to keep getting in Science or Nature or such.
Happily, that still hasn't happened and genome sequencing papers still show up in the whole range of journals. I don't claim I scan every one, but I do try to poke around in a lot of the eukaryotic papers (I long since gave up on bacterial; happily they have become essentially uncountable). Two recent genomes in major journals, Daphnia (water flea) in Science and Pongo (orangutan, not dalmatian!) in Nature show that the limit has not yet been reached. These papers share another thread: both genomes were sequenced using fluorescent capillary Sanger sequencing.
Sanger, of course, was the backbone of genome projects until only very recently. Even in the last few years, only a few large genomes have been initially published using second generation technologies
A computational biologist's personal views on new technologies & publications on genomics & proteomics and their impact on drug discovery
Showing posts with label genome sequencing. Show all posts
Showing posts with label genome sequencing. Show all posts
Tuesday, March 08, 2011
Friday, December 10, 2010
Is Pacific Biosciences Really Polaroid Genomics?
The New England Journal of Medicine this week carries a paper from Harvard and Pacific Biosciences detailing the sequence of the Vibrio cholerae strain responsible for the outbreak of cholera in Haiti. The paper and supplementary materials (which contains detailed methods and some ginormous tables) are free for now. There's also a nice piece in BioIT World giving a lot of backstory. Not a few other media outlets have carried it as well, but that's where I've read.
All in all, the project took about a month from an initial phone call from Harvard to Pacific Biosciences until the publication in NEJM. Yow!! Actual sequence generation took place 2 days after PacBio received the DNA. And this is sequencing two isolates (which turned out to be essentially identical) of the Haitian bug plus three reference strains. While full sequence generation took longer, useful data emerged three hours after getting on the sequencer (though there are apparently around 10 wall clock hours of prep before you can get on the sequencer). With the right software & sufficient computational horsepower, one really could imagine standing around the sequencer and watching a genome develop before your eyes (just don't shake the machine!).
Between this & the data on PacBio's DevNet site (you'll need to register to get a password), in theory one could find the answers to all the nagging questions about the performance specs. Actually, this dataset is apparently available as the assembled sequence but only summary statistics for certain aspects of it. For example, apparently dropped bases are focused on C's & G's, so these were discounted.
Read lengths were 1,100+/-170bp, which is quite good -- and this is after filtering out lower quality data -- and 5% of the reads were monsters bigger than 2800 bases. It is interesting that they did not use the circular consensus method, which was previously published in a method paper (which I covered earlier) and yields higher qualities but shorter fragments. It would be particularly useful to know if the circular consensus approach effectively dealt with the C/G dropout issue.
One small focus of the paper, especially in the supplement, is depth of sequence analysis to infer copy number variation. There is a nice plot in Supplementary Figure 2 illustrating how the copy number varies with distance from the origin of replication. If you haven't looked at bacterial replication before, most bacteria have a single circular chromosome and initiate synthesis starting at one point (the 0 minute point in E.coli). In very rapidly dividing bacteria, the cell may not even wait for one round of synthesis to complete before firing off another synthesis round, but in any case in any dividing population there will be more DNA near the origin than near the terminus of replication. Presumably one could estimate the growth kinetics based on the slope of the copy number from ori to ter!
After subtracting out this effect, most of the copy number fits a Poisson model quite nicely (Supplementary Figure 3). However, there is still some variation. Much of this is around ribosomal RNA operons, which are challenging to assemble correctly since they appear in arrays of nearly (or completely) perfect repeats which are quite long. There's actually even a table of the sequencing depth for each strain at 500 nucleotide intervals! Furthermore, Supplementary Figure 4 shows the depth of coverage (uncorrected for the replication polarity effect) at 6X, 12X, 30X and 60X coverage, illustrating how many of the trends are actually noticeable in the 6X data.
What biology came out of this? A number of genetic elements were identified in the Haitian strains which are consistent with it being a very bad actor and also that it is a variant of a nasty Asian strain.
All-in-all, this neatly demonstrates how PacBio could be the backbone of a very rapid biosurveillance network. It is surprising that in this day-and-age that the CDC (as detailed in the BioIT article) even bothered with a pulsed field study; even on other platforms the turnaround for a complete sequence wouldn't be much longer than to do the gel study, and the results are so much richer. Other technologies might work too, but the very long read lengths and fast turnaround offered should be very appealing, even if the cost of the instrument (much closer to $1M than to my budget!) isn't. But, a few instruments around the world serving other customers but with priority given to such samples could form an important tripwire for new infections, whether they be acts of nature or evil persons. Now, it is important to note that this involved a known, culturable bug and the DNA was derived from pure cultures, not straight environmental isolates.
On a personal note, I am quite itchy to try out one of these beasts. As a result, I'm making sure we stash some DNA generated by several projects so that we could use them as test samples. We know something about the sequence of these samples and how they performed with their intended platforms, so they would be ideal test items. None of my applications are nearly as exciting as this work, but they are the workaday sorts of things which could be the building blocks of a major flow of business. Anyone with a PacBio interested in collaborating is welcome to leave me a private comment (I won't moderate it through), and of course my employer would pay reasonable costs for such an exercise. Or, I certainly wouldn't stamp "return to sender" on the crate if an instrument showed up on the loading dock! I don't see PacBio clearing the stage of all competitors, but I do see it both opening new markets and throwing some serious elbows against certain competing technologies.
All in all, the project took about a month from an initial phone call from Harvard to Pacific Biosciences until the publication in NEJM. Yow!! Actual sequence generation took place 2 days after PacBio received the DNA. And this is sequencing two isolates (which turned out to be essentially identical) of the Haitian bug plus three reference strains. While full sequence generation took longer, useful data emerged three hours after getting on the sequencer (though there are apparently around 10 wall clock hours of prep before you can get on the sequencer). With the right software & sufficient computational horsepower, one really could imagine standing around the sequencer and watching a genome develop before your eyes (just don't shake the machine!).
Between this & the data on PacBio's DevNet site (you'll need to register to get a password), in theory one could find the answers to all the nagging questions about the performance specs. Actually, this dataset is apparently available as the assembled sequence but only summary statistics for certain aspects of it. For example, apparently dropped bases are focused on C's & G's, so these were discounted.
Read lengths were 1,100+/-170bp, which is quite good -- and this is after filtering out lower quality data -- and 5% of the reads were monsters bigger than 2800 bases. It is interesting that they did not use the circular consensus method, which was previously published in a method paper (which I covered earlier) and yields higher qualities but shorter fragments. It would be particularly useful to know if the circular consensus approach effectively dealt with the C/G dropout issue.
One small focus of the paper, especially in the supplement, is depth of sequence analysis to infer copy number variation. There is a nice plot in Supplementary Figure 2 illustrating how the copy number varies with distance from the origin of replication. If you haven't looked at bacterial replication before, most bacteria have a single circular chromosome and initiate synthesis starting at one point (the 0 minute point in E.coli). In very rapidly dividing bacteria, the cell may not even wait for one round of synthesis to complete before firing off another synthesis round, but in any case in any dividing population there will be more DNA near the origin than near the terminus of replication. Presumably one could estimate the growth kinetics based on the slope of the copy number from ori to ter!
After subtracting out this effect, most of the copy number fits a Poisson model quite nicely (Supplementary Figure 3). However, there is still some variation. Much of this is around ribosomal RNA operons, which are challenging to assemble correctly since they appear in arrays of nearly (or completely) perfect repeats which are quite long. There's actually even a table of the sequencing depth for each strain at 500 nucleotide intervals! Furthermore, Supplementary Figure 4 shows the depth of coverage (uncorrected for the replication polarity effect) at 6X, 12X, 30X and 60X coverage, illustrating how many of the trends are actually noticeable in the 6X data.
What biology came out of this? A number of genetic elements were identified in the Haitian strains which are consistent with it being a very bad actor and also that it is a variant of a nasty Asian strain.
All-in-all, this neatly demonstrates how PacBio could be the backbone of a very rapid biosurveillance network. It is surprising that in this day-and-age that the CDC (as detailed in the BioIT article) even bothered with a pulsed field study; even on other platforms the turnaround for a complete sequence wouldn't be much longer than to do the gel study, and the results are so much richer. Other technologies might work too, but the very long read lengths and fast turnaround offered should be very appealing, even if the cost of the instrument (much closer to $1M than to my budget!) isn't. But, a few instruments around the world serving other customers but with priority given to such samples could form an important tripwire for new infections, whether they be acts of nature or evil persons. Now, it is important to note that this involved a known, culturable bug and the DNA was derived from pure cultures, not straight environmental isolates.
On a personal note, I am quite itchy to try out one of these beasts. As a result, I'm making sure we stash some DNA generated by several projects so that we could use them as test samples. We know something about the sequence of these samples and how they performed with their intended platforms, so they would be ideal test items. None of my applications are nearly as exciting as this work, but they are the workaday sorts of things which could be the building blocks of a major flow of business. Anyone with a PacBio interested in collaborating is welcome to leave me a private comment (I won't moderate it through), and of course my employer would pay reasonable costs for such an exercise. Or, I certainly wouldn't stamp "return to sender" on the crate if an instrument showed up on the loading dock! I don't see PacBio clearing the stage of all competitors, but I do see it both opening new markets and throwing some serious elbows against certain competing technologies.
Sunday, October 31, 2010
Plenty of Genomes are Still Fair Game for Sequencing

I've been grossly neglecting this space for an entire month with only the usual excuses -- big work projects, a lot of reading, etc. None good enough. Worst of all, as usual, it's not that I haven't composed possible entries in my head -- they just never get past my fingertips.
Tonight is the night most associated with pumpkins, and an earlier highlight was attending the Topsfield Fair, where the pictured specimen was on display. Amazing as it is, it fell nearly 15 pounds shy of the world record. If you want to try to grow your own, every year the variety which has dominated the winners can be purchased. Nature isn't all though; champion pumpkin growing requires a lot of specialized culture ranging from allowing only a single fruit to set to injecting nutrients just upstream of that fruit.
Sometime in recent memory there were some other blogs noted in GenomeWeb for discussing whether there are any truly remarkable genome sequencing projects left. Which I've been pondering: what makes for a very interesting species to sequence. Now, both of the bloggers mentioned clearly were not fond of either "K" genome project -- the 1,000 humans or 10,000 vertebrates. There were also some potshots taken at the "delicious or cute" genomes concept. One suggested that no interesting metazoa ("animals") are left.
So, what does make an interesting genome? Well, I can think of several broad categories. I'll try to throw out possible examples of each, though to be honest I wouldn't be surprised if some of these genomes are sequenced or nearly so -- it's very hard to keep track of complete genomes these days!
First, which I think would resonate with those two critical articles, would be genomes with interesting histories -- genomes that might tell us stories purely about DNA. This was the bent of these papers I refer to. In particular, they were thinking of many of the unicellular eukaryotes which are the result of multiple endosymbiont acquisition / genome fusion events. But, I would definitely throw into this category a particular animal: the Bdelloid rotifers, which have gone without recombination for a seeming eternity. Of course, to really understand that genome, you'd need to also sequence one of the less chaste rotifers.
Another hugely interesting class of genomes would be those to shed light on development and its evolution (evo-devo). In particular, there are a lot of arthopod genomes yet unsequenced -- from what I've noted it appears that most sequenced arthropods are either disease vectors, agricultural pests or economically important (plus, of course, the model Drosophila). Even so, I'd guess there are not many more than a dozen complete arthopod genomes so far -- quite a paucity considering the wealth of insects alone. And, if I'm not mistaken, mostly insects and an arachnid or two have gone fully through the sequencer -- where are all the others? By the way, I'd be happy to help with sample prep for the Homarus americanus genome!
Another huge space of genomes worth exploring are those were we are likely to find unusual biochemistry going on. Now, a lot of those genomes are bacterial or fungal, but there are also an awful lot of advanced plants that have interesting & useful biochemical syntheses.
All that said, I find it odd that some don't see the import and utility of sequencing many, many humans and a lot of vertebrates also. It is important to remember that a lot of funding is from the public, and the public considers many of these other pursuits less important than making medical advances. It is easy for those of us in the biology community to see the longer threads connecting these projects to human health or just the importance of pursuing curiosity, but that doesn't always sell well in public.
An optimistic view is that all the frustrated sequencers should hunker down and patiently wait; data generation for new genomes is getting cheaper by the minute, with short reads to fill out the sequence and ultra-long reads to replace physical mapping. A more conservative view holds that bioinformatics & data storage will soon dominate the equation, which might still make it hard to get lots of worthy genomes sequenced.
Personally, I can't stroll a country fair without wanting to sequence just about everything I see on display -- the chickens that look like Philadelphia Mummers, the two yard long squash, bizarrely shaped tomatoes -- and of course, the three quarter ton plus pumpkins.
Tuesday, September 21, 2010
Review: The $1000 Genome
Kevin Davies' "The $1000 Genome" deserves to be widely read. Readers of this space will not be surprised that there are a few changes I might have imposed had I been its editor, but on the whole it presents a careful and I think entertaining view of the past and possible future of personal genomics.
The book is intended for a far wider audience than geeky genomics bloggers, so the emphasis is not on the science. Rather, it is on some of the key movers-and-shakers in the field and some of the companies which have been dominating this space, ranging from the first personal genetic mapping companies (23 and Me, Navigenics, Pathway Genomics and deCodeMe) to the instrument makers (such as Solexa/Illumina, Helicos, Pacific Biosciences, ABI and Oxford Nanopore) to those working on various aspects of human genome sequencing services (such as Knome and Complete Genomics. Various ups and downs of these companies -- and the debates they have engendered -- are covered as well as the possible impacts on society. Along the way, we see a few glimpses of Davies exploring his own genome and some of the biological history which he seeks to enlighten through these expeditions.
It is not a trivial task to try to explain this field to an educated lay public, but I think in general Davies does a good job. The overviews of the technologies are limited but give the gist of things. Anyone writing in this space is faced with the dilemma of trying to explain too much and losing the main thread or failing to explain and preventing the reader from finding it. Mostly I think he has succeeded in threading this needle, perhaps because only rarely did I feel he had missed. One example I did note was in explaining PacBio's technology; hardly anyone in science will know what a zeptoliter is, let alone someone outside of it. On the other hand, what analogy or refactoring of that term could remove it from the edges of science fiction? Not an easy challenge!
For better or worse, once I've decided I generally like a book like this my next thoughts are what could be removed and what could be added. I really could find little to remove. But, there are a few things I wish were either expanded or had made it in altogether.
It would be dreary to enumerate every company which has ever thrown its hat in the DNA sequencing ring. It is valuable that Davies covers a few of the abject failures, such as Manteia (which did yield some key technology to Illumina when sold for assets) and US Genomics. There is scant coverage, other than by mention, of most of the companies which have but nascent attempts to enter the arena. However, the one story I really did miss was anything about the Polonator. It's not that I really think this system will conquer the others (though perhaps I hope it will hold its own), it just represents a very different tack in corporate strategy that would have been interesting to contrast with the other players.
Davies has been in the thick of the field as editor of Bio IT World, so this is no stitching together of secondary sources. I also appreciated that he includes both the ups and the downs for these companies, emphasizing that this has not been easy for any of them. But, that added to my surprise at several incidents which were left out (believe me, many were left in I had never heard before). Davies describes how Helicos delivered an instrument to the CRO Expression Analysis, but not that it was very publicly returned for failing to perform to spec. Nor is Helicos' failed attempt to sell themselves mentioned. An interesting anecdote on Complete Genomics is how a wildfire nearly disrupted one of their first human genome runs; left out is the near-death experience of that company when it was forced to either lay off or defer salaries for nearly all of its staff. The section on Complete's founder Rade Drmanac mentioned Hyseq, but not the company (or was it two) which he ran between Hyseq and Complete to try to commercialize sequencing-by-hybridization. This would have added to this portrait of determination -- and the travails of the corporate arena. I was also surprised that the short profile of Sydney Brenner as a personal genomics skeptic didn't include the fact he invented the technology behind Lynx, which was another early attempt in non-electrophoretic sequencing. Some would see that as irony.
Another area I would like to have seen expanded was the exploration of groups such as Patients Like Me, which are windows on how much people are willing to chance disclosing sensitive medical information. One section explores the fact that several prominent persons interested in this field became so when their children were diagnosed with rare recessive disorders, leading them to ponder whether they would have made the same marriage had they known in advance of this danger. I was surprised that little of the existing experience in this area was explored; I believe the Ashkenazi population has dealt with this in screening for Tay-Sachs and other horrific disorders which are prevalent there.
The book is stunningly up-to-date for something published the beginning of September; some incidents as late as June are reported. Despite this, I found little evidence of haste. I'm still trying to figure out what a "nature capitalist" is, but that's the only case I spotted of a likely mis-wording.
Davies briefly explores possible uses of these sequencing technologies beyond our germline sequences, but only very briefly. Personally, I think that cancer genomics will have a more immediate and perhaps greater overall impact on human medicine, and wish it had gotten a bit more in depth treatment.
Davies in a expatriot Brit, living not very far from me. The sections on the possible impact of widespread genome sequencing on medicine are written almost entirely from a U.S. perspective, with our hybrid public-private healthcare system. I suspect European readers would hunger for more discussion of how personal genomics might be handled within their socialized medical systems and different histories of handling the ethical issues (Germany, I believe, has pretty much banned personal genomics services). On this side of the pond, he does a nice job of showing how different state agencies have charged into the breach left, until recently, by the FDA.
Okay, too many quibbles. Well, maybe one last one -- it would have been nice to see more on some of the academic bioinformaticians who have created such wonderful and amazing open-source tools as Bowtie and BWA.
As I mentioned above, Davies injects a good amount of himself into all this. I've encountered books (indeed, on recently on moon walkers), in which this becomes a tedious over-exposure to the author's ego. This is not such a book. The personal bits either link pieces of the story or make them more approachable. We find out that he has already attained a greater age than his father did (due to testicular cancer, one of the few cancers in which overwhelming progress has been made), leading to questions he hopes his genome can answer. Hence, his trying out of pretty much all of the array-based personal genetic services. But, he does not address one question that the book raised in my mind: will the royalties from this project fund a complete Davies genome?
The book is intended for a far wider audience than geeky genomics bloggers, so the emphasis is not on the science. Rather, it is on some of the key movers-and-shakers in the field and some of the companies which have been dominating this space, ranging from the first personal genetic mapping companies (23 and Me, Navigenics, Pathway Genomics and deCodeMe) to the instrument makers (such as Solexa/Illumina, Helicos, Pacific Biosciences, ABI and Oxford Nanopore) to those working on various aspects of human genome sequencing services (such as Knome and Complete Genomics. Various ups and downs of these companies -- and the debates they have engendered -- are covered as well as the possible impacts on society. Along the way, we see a few glimpses of Davies exploring his own genome and some of the biological history which he seeks to enlighten through these expeditions.
It is not a trivial task to try to explain this field to an educated lay public, but I think in general Davies does a good job. The overviews of the technologies are limited but give the gist of things. Anyone writing in this space is faced with the dilemma of trying to explain too much and losing the main thread or failing to explain and preventing the reader from finding it. Mostly I think he has succeeded in threading this needle, perhaps because only rarely did I feel he had missed. One example I did note was in explaining PacBio's technology; hardly anyone in science will know what a zeptoliter is, let alone someone outside of it. On the other hand, what analogy or refactoring of that term could remove it from the edges of science fiction? Not an easy challenge!
For better or worse, once I've decided I generally like a book like this my next thoughts are what could be removed and what could be added. I really could find little to remove. But, there are a few things I wish were either expanded or had made it in altogether.
It would be dreary to enumerate every company which has ever thrown its hat in the DNA sequencing ring. It is valuable that Davies covers a few of the abject failures, such as Manteia (which did yield some key technology to Illumina when sold for assets) and US Genomics. There is scant coverage, other than by mention, of most of the companies which have but nascent attempts to enter the arena. However, the one story I really did miss was anything about the Polonator. It's not that I really think this system will conquer the others (though perhaps I hope it will hold its own), it just represents a very different tack in corporate strategy that would have been interesting to contrast with the other players.
Davies has been in the thick of the field as editor of Bio IT World, so this is no stitching together of secondary sources. I also appreciated that he includes both the ups and the downs for these companies, emphasizing that this has not been easy for any of them. But, that added to my surprise at several incidents which were left out (believe me, many were left in I had never heard before). Davies describes how Helicos delivered an instrument to the CRO Expression Analysis, but not that it was very publicly returned for failing to perform to spec. Nor is Helicos' failed attempt to sell themselves mentioned. An interesting anecdote on Complete Genomics is how a wildfire nearly disrupted one of their first human genome runs; left out is the near-death experience of that company when it was forced to either lay off or defer salaries for nearly all of its staff. The section on Complete's founder Rade Drmanac mentioned Hyseq, but not the company (or was it two) which he ran between Hyseq and Complete to try to commercialize sequencing-by-hybridization. This would have added to this portrait of determination -- and the travails of the corporate arena. I was also surprised that the short profile of Sydney Brenner as a personal genomics skeptic didn't include the fact he invented the technology behind Lynx, which was another early attempt in non-electrophoretic sequencing. Some would see that as irony.
Another area I would like to have seen expanded was the exploration of groups such as Patients Like Me, which are windows on how much people are willing to chance disclosing sensitive medical information. One section explores the fact that several prominent persons interested in this field became so when their children were diagnosed with rare recessive disorders, leading them to ponder whether they would have made the same marriage had they known in advance of this danger. I was surprised that little of the existing experience in this area was explored; I believe the Ashkenazi population has dealt with this in screening for Tay-Sachs and other horrific disorders which are prevalent there.
The book is stunningly up-to-date for something published the beginning of September; some incidents as late as June are reported. Despite this, I found little evidence of haste. I'm still trying to figure out what a "nature capitalist" is, but that's the only case I spotted of a likely mis-wording.
Davies briefly explores possible uses of these sequencing technologies beyond our germline sequences, but only very briefly. Personally, I think that cancer genomics will have a more immediate and perhaps greater overall impact on human medicine, and wish it had gotten a bit more in depth treatment.
Davies in a expatriot Brit, living not very far from me. The sections on the possible impact of widespread genome sequencing on medicine are written almost entirely from a U.S. perspective, with our hybrid public-private healthcare system. I suspect European readers would hunger for more discussion of how personal genomics might be handled within their socialized medical systems and different histories of handling the ethical issues (Germany, I believe, has pretty much banned personal genomics services). On this side of the pond, he does a nice job of showing how different state agencies have charged into the breach left, until recently, by the FDA.
Okay, too many quibbles. Well, maybe one last one -- it would have been nice to see more on some of the academic bioinformaticians who have created such wonderful and amazing open-source tools as Bowtie and BWA.
As I mentioned above, Davies injects a good amount of himself into all this. I've encountered books (indeed, on recently on moon walkers), in which this becomes a tedious over-exposure to the author's ego. This is not such a book. The personal bits either link pieces of the story or make them more approachable. We find out that he has already attained a greater age than his father did (due to testicular cancer, one of the few cancers in which overwhelming progress has been made), leading to questions he hopes his genome can answer. Hence, his trying out of pretty much all of the array-based personal genetic services. But, he does not address one question that the book raised in my mind: will the royalties from this project fund a complete Davies genome?
Wednesday, June 23, 2010
PacBio oiBcaP PacBio oiBcaP
PacBio has a paper in Nucleic Acid Research giving a few more details on sample prep and some limited sequencing quality data on their platform.
The gist (which was largely in the previous Science paper) is that the templates are prepared by ligating a hairpin structure on to each end. In this paper they focused on a PCR product, with the primers designed with type IIS (BsaI) restriction sites flanking the insert. Digestion yields sticky ends (and with BsaI, these can be designed to be non-symmetric to discourage concatamerization) which enable ligating the adapters. Of course, for applying this on a large scale you do run into the problem of avoiding the restriction sites. There are other approaches, such as uracil-laden primer segments which can be specifically destroyed the DUT and UNG enzymes -- the only catch being that many of the most accurate polymerases can be poisoned by uracil. A couple of other schemes have either been proven or can be sketched quickly.
In any case, once these molecules are formed they have the nice property of being topologically circular and having the insert sequence represented twice (once in reverse complement) within that circle, with the two pieces separated by the known linker sequence. So if the polymerase lasts to go round-and-around, then each pass gives another crack at the sequence. For short templates, this gives lots of rounds and even on a longer (1Kb) PCR product they successfully had three bites at the apple. Feed these into a probabilistic aligner and each pass improves the quality values. Running the same template many times enabled calibrating the quality values (which are Phred-scaled), showing them to be a reasonable estimator of the likelihood of miscalling the consensus.
Given that eventually their polymerase dies on a template, they point out that the length of a fragment is highly predictive of the number of passes which can be observed which in turn implies the sequencing quality which will be obtained for a template. This makes for an interesting variation on the usual "what size library" question in sequencing. For sequencing a bacterial genome, it may make sense to go very deep with short fragments to get high quality pieces which can then be scaffolded using much longer individual reads without the circular consensus effect as well as the spaced "strobe" reads to provide very long range scaffolding. For a bacterial genome & the claims made of capacity per SMRT cell, it would seem that making 2 libraries (one short for consensus sequencing, one long for the others) and burning perhaps 4-5 $100 SMRT cells, one could get a very good bacterial genome draft. For targeted sequencing by PCR, the approach is quite attractive. Amplicon length becomes an important variable for final quality. Appropriate matching of the upstream multi-PCR technology (such as RainDance or Fluidigm or just lots of conventional PCRs) with PacBio for capacity will be necessary -- and hard numbers on throughput/SMRT cell are desperately needed!
One side thought: is there a risk with strobe sequencing of accidentally going off the end and reading back along the same insert -- but not realizing it since the flip turn in the hairpin was done in the dark? It would suggest that libraries for strobe sequencing need to be very stringently sized.
One curiosity of their plot of this phenomenon is that the values appear to be asymptotically approaching phred 40 -- an error rate of 1 in 10,000. Is this really where things top out? That's a good quality -- but for some applications possibly not good enough.
They go on to apply this to measuring the allele frequency of a SNP in defined mixtures. Given several thousand reads and the circular consensus, they succeeded at this -- even when the minor allele was present at a frequency of 2.5%. This isn't as far as some groups have pushed 2nd generation sequencing for rare allele detection (such as finding a cancer mutation in a background of normal DNA), but is certainly in the range of many schemes used in this context (such as real-time PCR). If my inference of a maximum phred score is correct, it would tend to limit the sensitivity for rare mutation detection somewhere in the parts-per-thousand range; the record is around this (1 in 0.1% reported by Yauch et al)
My major concern with this paper is the lack of a breadth of substrates. Are there template properties which give trouble to the system? The obvious candidates are extremes of composition, secondary structures and simple repeats. Will their polymerase power through 80+% GC? Complex hairpins? Will stuttering be observed on long simple repeats? Can long mononucleotide runs be measured accurately?
In the end, the key uses for the first release PacBio system will depend on where these problems are and those throughput numbers. This paper is a useful bit of information, but much more is needed to determine when PacBio is either cost-effective for an application (vs. other sequencing or non-sequencing competitors) or when the performance advantages in that application (such as turn-around time) push them to the fore. Personally, I think it is a mistake on PacBio's part not to be literally flooding the world with data. Or at least the bioinformatics world -- only when they start releasing data to the broad developer community will there be the critical tweaking of existing tools and development of new tools to really take advantage of this platform and also cope with whatever weaknesses it possesses.

Travers, K., Chin, C., Rank, D., Eid, J., & Turner, S. (2010). A flexible and efficient template format for circular consensus sequencing and SNP detection Nucleic Acids Research DOI: 10.1093/nar/gkq543
The gist (which was largely in the previous Science paper) is that the templates are prepared by ligating a hairpin structure on to each end. In this paper they focused on a PCR product, with the primers designed with type IIS (BsaI) restriction sites flanking the insert. Digestion yields sticky ends (and with BsaI, these can be designed to be non-symmetric to discourage concatamerization) which enable ligating the adapters. Of course, for applying this on a large scale you do run into the problem of avoiding the restriction sites. There are other approaches, such as uracil-laden primer segments which can be specifically destroyed the DUT and UNG enzymes -- the only catch being that many of the most accurate polymerases can be poisoned by uracil. A couple of other schemes have either been proven or can be sketched quickly.
In any case, once these molecules are formed they have the nice property of being topologically circular and having the insert sequence represented twice (once in reverse complement) within that circle, with the two pieces separated by the known linker sequence. So if the polymerase lasts to go round-and-around, then each pass gives another crack at the sequence. For short templates, this gives lots of rounds and even on a longer (1Kb) PCR product they successfully had three bites at the apple. Feed these into a probabilistic aligner and each pass improves the quality values. Running the same template many times enabled calibrating the quality values (which are Phred-scaled), showing them to be a reasonable estimator of the likelihood of miscalling the consensus.
Given that eventually their polymerase dies on a template, they point out that the length of a fragment is highly predictive of the number of passes which can be observed which in turn implies the sequencing quality which will be obtained for a template. This makes for an interesting variation on the usual "what size library" question in sequencing. For sequencing a bacterial genome, it may make sense to go very deep with short fragments to get high quality pieces which can then be scaffolded using much longer individual reads without the circular consensus effect as well as the spaced "strobe" reads to provide very long range scaffolding. For a bacterial genome & the claims made of capacity per SMRT cell, it would seem that making 2 libraries (one short for consensus sequencing, one long for the others) and burning perhaps 4-5 $100 SMRT cells, one could get a very good bacterial genome draft. For targeted sequencing by PCR, the approach is quite attractive. Amplicon length becomes an important variable for final quality. Appropriate matching of the upstream multi-PCR technology (such as RainDance or Fluidigm or just lots of conventional PCRs) with PacBio for capacity will be necessary -- and hard numbers on throughput/SMRT cell are desperately needed!
One side thought: is there a risk with strobe sequencing of accidentally going off the end and reading back along the same insert -- but not realizing it since the flip turn in the hairpin was done in the dark? It would suggest that libraries for strobe sequencing need to be very stringently sized.

They go on to apply this to measuring the allele frequency of a SNP in defined mixtures. Given several thousand reads and the circular consensus, they succeeded at this -- even when the minor allele was present at a frequency of 2.5%. This isn't as far as some groups have pushed 2nd generation sequencing for rare allele detection (such as finding a cancer mutation in a background of normal DNA), but is certainly in the range of many schemes used in this context (such as real-time PCR). If my inference of a maximum phred score is correct, it would tend to limit the sensitivity for rare mutation detection somewhere in the parts-per-thousand range; the record is around this (1 in 0.1% reported by Yauch et al)
My major concern with this paper is the lack of a breadth of substrates. Are there template properties which give trouble to the system? The obvious candidates are extremes of composition, secondary structures and simple repeats. Will their polymerase power through 80+% GC? Complex hairpins? Will stuttering be observed on long simple repeats? Can long mononucleotide runs be measured accurately?
In the end, the key uses for the first release PacBio system will depend on where these problems are and those throughput numbers. This paper is a useful bit of information, but much more is needed to determine when PacBio is either cost-effective for an application (vs. other sequencing or non-sequencing competitors) or when the performance advantages in that application (such as turn-around time) push them to the fore. Personally, I think it is a mistake on PacBio's part not to be literally flooding the world with data. Or at least the bioinformatics world -- only when they start releasing data to the broad developer community will there be the critical tweaking of existing tools and development of new tools to really take advantage of this platform and also cope with whatever weaknesses it possesses.

Travers, K., Chin, C., Rank, D., Eid, J., & Turner, S. (2010). A flexible and efficient template format for circular consensus sequencing and SNP detection Nucleic Acids Research DOI: 10.1093/nar/gkq543
Thursday, April 29, 2010
Application of Second Generation Sequencing to Cancer Genomics
A review by me titled "Application of Second Generation Sequencing to Cancer Genomics" is now available on the Advance Access section of Briefings in Bioinformatics. You'll need a subscription to read it.
I got a little obsessive about making the paper comprehensive. While the paper does focus on using second generation sequencing for mutation, rearrangement and copy number aberration detection (explicitly ruling out of scope RNA-Seq and epigenomics), it does attempt to touch on every paper in the field up to March 1st. To my chagrin I discovered just after submitting the final revision that I had omitted one paper. I was able to slide it into the final proof, but not without making a small error. There's one other paper I might have mentioned that actually used whole genome amplification upstream of second generation sequencing on a human sample, though it's not a very good paper, the sequencing coverage is horrid and wasn't about cancer. In any case, it won't shock me completely -- but a lot -- if someone can find a paper in that timeframe that I missed. So don't gloat too much if you find one -- but please post here if you find any!
Of course, any constructive criticism is welcome. There are bits I would be tempted to rewrite if I went through the exercise again and the part on predicting the functional implications of mutations could easily be blown out into a review of its own. I don't have time to commit to that, but if anyone wants to draft one I'd help shepherd it at Briefings. I'm actually on the Editorial Board there and this review erases my long-term guilt over being on the masthead for a number of years without actually contributing anything.
As I state in the intro, in a field such as this a printed review is doomed to made incomplete very quickly. I'm actually a bit surprised that there has been only one major cancer genomics paper between my cutoff and the preprint emerging -- the breast cancer quartet paper from Wash U. I fully expect many more papers to appear before the physical issue shows up (probably in the fall) and certainly a year from now much should have happened. But, it is useful to mark off the state of a field at a certain time. In some fields it is common to publish annual or semi-annual reviews which update on all the major events since the last review; perhaps I should start logging papers with that sort of concept in mind.
One last note: now I can read "the competition". Seriously, another review on the subject by Elaine Mardis and Rick Wilson came out around the time I had my first crude set of paragraphs (it would be stretch to grant it the title of draft). At that time, I had two small targeted projects in process and they had already published two leukemia genome sequences. It was tempting to read it, but I feared I would be overly influenced by it or worse would be paranoid about plagiarizing bits, so I decided not to read it until my review published.
I got a little obsessive about making the paper comprehensive. While the paper does focus on using second generation sequencing for mutation, rearrangement and copy number aberration detection (explicitly ruling out of scope RNA-Seq and epigenomics), it does attempt to touch on every paper in the field up to March 1st. To my chagrin I discovered just after submitting the final revision that I had omitted one paper. I was able to slide it into the final proof, but not without making a small error. There's one other paper I might have mentioned that actually used whole genome amplification upstream of second generation sequencing on a human sample, though it's not a very good paper, the sequencing coverage is horrid and wasn't about cancer. In any case, it won't shock me completely -- but a lot -- if someone can find a paper in that timeframe that I missed. So don't gloat too much if you find one -- but please post here if you find any!
Of course, any constructive criticism is welcome. There are bits I would be tempted to rewrite if I went through the exercise again and the part on predicting the functional implications of mutations could easily be blown out into a review of its own. I don't have time to commit to that, but if anyone wants to draft one I'd help shepherd it at Briefings. I'm actually on the Editorial Board there and this review erases my long-term guilt over being on the masthead for a number of years without actually contributing anything.
As I state in the intro, in a field such as this a printed review is doomed to made incomplete very quickly. I'm actually a bit surprised that there has been only one major cancer genomics paper between my cutoff and the preprint emerging -- the breast cancer quartet paper from Wash U. I fully expect many more papers to appear before the physical issue shows up (probably in the fall) and certainly a year from now much should have happened. But, it is useful to mark off the state of a field at a certain time. In some fields it is common to publish annual or semi-annual reviews which update on all the major events since the last review; perhaps I should start logging papers with that sort of concept in mind.
One last note: now I can read "the competition". Seriously, another review on the subject by Elaine Mardis and Rick Wilson came out around the time I had my first crude set of paragraphs (it would be stretch to grant it the title of draft). At that time, I had two small targeted projects in process and they had already published two leukemia genome sequences. It was tempting to read it, but I feared I would be overly influenced by it or worse would be paranoid about plagiarizing bits, so I decided not to read it until my review published.
Thursday, January 21, 2010
A plethora of MRSA sequences
The Sanger Institute's paper in Science describing the sequencing of multiple MRSA (methicillin-resistant Staphylococcus aureus) genomes is very nifty and demonstrates a whole new potential market for next-generation sequencing: the tracking of infections in support of better control measures.
MRSA is a serious health issue; a friend of mine's relative is battling it right now. MRSA is commonly acquired in health care facilities. Further spread can be combated by rigorous attention to disinfection and sanitation measures. A key question is when MRSA shows up, where did it come from? How does it spread across a hospital, a city, a country or the world?
The gist of the methodology is to grow isolates overnight in the appropriate medium and extract the DNA. Each isolate is then converted into an Illumina library, with multiplex tags to identify it. The reference MRSA strain was also thrown in as a control. Using the GAII as they did, they packed 12 libraries onto one run -- over 60 isolates were sequenced for the whole study. With increasing cluster density and the new HiSeq instrument, one could imagine 2-5 fold (or perhaps greater) packing being practical; i.e. the entire study might fit on one run.
The library prep method sounds potentially automatable -- shearing on the covaris instrument, cleanup using a 96 well plate system, end repair, removal of small (<150nt) fragments with size exclusion beads, A-tailing, another 150nt filtering by beads, adapter ligation, another 150nt filtering, PCR to introduce the multiplexing tags, another filtering for <150nt, quantitation and then pooling. Sequencing was "only" 36nt single end, with an average of 80Mb, Alignment to the reference genome was by ssaha (a somewhat curious choice, but perhaps now they'd use BWA) and SNP calling with ssaha_pileup; non-mapping reads were assembled with velvet and did identify novel mobile element insertions. According to GenomeWeb, the estimated cost was about $320 per sample. That's probably just a reagents cost, but gives a ballpark figure.
Existing typing methods either look at SNPs or specific sequence repeats, and while these often work they sometimes give conflicting information and other times lack the power to resolve closely related isolates. Having high resolution is important for teasing apart the history of an outbreak -- correlating patient isolates with samples obtained from the environment (such as hospital floors & such).
Phylogenetic analysis using SNPs in the "core genome" showed a strong pattern of geographical clustering -- but with some key exceptions, suggesting intercontinental leaps of the bug.
Could such an approach become routine for infection monitoring? A fully-loaded cost might be closer to $20K per experiment or higher. With appropriate budgeting, this can be balanced against the cost of treating an expanding number of patients and providing expensive support (not to mention the human misery involved). Full genome sequencing might also not always be necessary; targeted sequencing could potentially allow packing even more samples onto each run. Targeted sequencing by PCR might also enable eliding the culturing step. Alternatively, cheaper (and faster; this is still a multi-day Illumina run) sequencers might be used. And, of course, this can easily be expanded to other infectious diseases with important public health implications. For those that are expensive or slow to grow, PCR would be particularly appropriate.
It is also worth noting that we're only about 15 years since the first bacterial genome was sequenced. Now, the thought of doing hundreds a week is not at all daunting. Resequencing a known bug is clearly bioinformatically less of a challenge, but still how far we've come!

Simon R. Harris, Edward J. Feil, Matthew T. G. Holden, Michael A. Quail, Emma K. Nickerson, Narisara Chantratita, Susana Gardete, Ana Tavares, Nick Day, Jodi A. Lindsay, Jonathan D. Edgeworth, HermÃnia de Lencastre, Julian Parkhill, Sharon J. Peacock, & Stephen D. Bentley (2010). Evolution of MRSA During Hospital Transmission and Intercontinental Spread Science, 327 (5964), 469-474 : 10.1126/science.1182395
MRSA is a serious health issue; a friend of mine's relative is battling it right now. MRSA is commonly acquired in health care facilities. Further spread can be combated by rigorous attention to disinfection and sanitation measures. A key question is when MRSA shows up, where did it come from? How does it spread across a hospital, a city, a country or the world?
The gist of the methodology is to grow isolates overnight in the appropriate medium and extract the DNA. Each isolate is then converted into an Illumina library, with multiplex tags to identify it. The reference MRSA strain was also thrown in as a control. Using the GAII as they did, they packed 12 libraries onto one run -- over 60 isolates were sequenced for the whole study. With increasing cluster density and the new HiSeq instrument, one could imagine 2-5 fold (or perhaps greater) packing being practical; i.e. the entire study might fit on one run.
The library prep method sounds potentially automatable -- shearing on the covaris instrument, cleanup using a 96 well plate system, end repair, removal of small (<150nt) fragments with size exclusion beads, A-tailing, another 150nt filtering by beads, adapter ligation, another 150nt filtering, PCR to introduce the multiplexing tags, another filtering for <150nt, quantitation and then pooling. Sequencing was "only" 36nt single end, with an average of 80Mb, Alignment to the reference genome was by ssaha (a somewhat curious choice, but perhaps now they'd use BWA) and SNP calling with ssaha_pileup; non-mapping reads were assembled with velvet and did identify novel mobile element insertions. According to GenomeWeb, the estimated cost was about $320 per sample. That's probably just a reagents cost, but gives a ballpark figure.
Existing typing methods either look at SNPs or specific sequence repeats, and while these often work they sometimes give conflicting information and other times lack the power to resolve closely related isolates. Having high resolution is important for teasing apart the history of an outbreak -- correlating patient isolates with samples obtained from the environment (such as hospital floors & such).
Phylogenetic analysis using SNPs in the "core genome" showed a strong pattern of geographical clustering -- but with some key exceptions, suggesting intercontinental leaps of the bug.
Could such an approach become routine for infection monitoring? A fully-loaded cost might be closer to $20K per experiment or higher. With appropriate budgeting, this can be balanced against the cost of treating an expanding number of patients and providing expensive support (not to mention the human misery involved). Full genome sequencing might also not always be necessary; targeted sequencing could potentially allow packing even more samples onto each run. Targeted sequencing by PCR might also enable eliding the culturing step. Alternatively, cheaper (and faster; this is still a multi-day Illumina run) sequencers might be used. And, of course, this can easily be expanded to other infectious diseases with important public health implications. For those that are expensive or slow to grow, PCR would be particularly appropriate.
It is also worth noting that we're only about 15 years since the first bacterial genome was sequenced. Now, the thought of doing hundreds a week is not at all daunting. Resequencing a known bug is clearly bioinformatically less of a challenge, but still how far we've come!

Simon R. Harris, Edward J. Feil, Matthew T. G. Holden, Michael A. Quail, Emma K. Nickerson, Narisara Chantratita, Susana Gardete, Ana Tavares, Nick Day, Jodi A. Lindsay, Jonathan D. Edgeworth, HermÃnia de Lencastre, Julian Parkhill, Sharon J. Peacock, & Stephen D. Bentley (2010). Evolution of MRSA During Hospital Transmission and Intercontinental Spread Science, 327 (5964), 469-474 : 10.1126/science.1182395
Thursday, December 17, 2009
A Doublet of Solid Tumor Genomes
Nature this week published two papers describing the complete sequencing of a cancer cell line (small cell lung cancer (SCLC) NCI-H209 and melanoma COLO-829) each along with a "normal" cell line from the same individual. I'll confess a certain degree of disappointment at first as these papers are not rich in the information of greatest interest to me, but they have grown on me. Plus, it's rather churlish to complain when I have nothing comparable to offer myself.
Both papers have a good deal of similar structure, perhaps because their author lists share a lot of overlap, including the same first author. However, technically they are quite different. The melanoma sequencing used the Illumina GAII, generating 2x75 paired end reads supplemented with 50x2 paired end reads from 3-4Kb inserts, whereas the SCLC paper used 2x25 mate pair SOLiD libraries with inserts between 400 and 3000 bp.
The papers have estimates of the false positive and false negative rates for the detection of various mutations, in comparison to Sanger data. For single base pair substitutions on the Illumina platform in the melanoma sample, 88% of previously known variants were found and 97% of a sample of 470 newly found variants confirmed by Sanger. However, on small insertion/deletion (indel) there was both less data and much less success. Only one small deletion was previously known, a 2 base deletion which is key to the biology. This was not found by the automated alignment and analysis, though reads containing this indel could be found in the data. A sample of 182 small indels were checked by Sanger and only 36% were confirmed. On large rearrangements, 75% of those tested confirmed by PCR.
The statistics for the SOLiD data in SCLC were comparable. 76% of previously known single nucleotide variants were found and 97% of newly found variants confirmed by Sanger. Two small indels were previously known and neither was found and conversely only 25% of predicted indels confirmed by Sanger. 100% of large rearrangements tested by PCR validated. So overall, both platforms do well for detecting rearrangements and substitutions and are very weak for small indels.
The overall mutation hauls were large, after filtering out variants found in the normal cell line. 22,910 substitutions for the SCLC line and 33,345 in the melanoma line. Both of these samples reflect serious environmental abuse; melanomas often arise from sun exposure and the particular cancer morphology the SCLC line is derived from is characteristic of smokers (the smoking history of the patient was unknown). Both lines showed mutation spectra in agreement with what is previously known about these environmental insults. 92% of C>T single substitutions occured at the second base of a pyrimidne dimers (CC or CT sequences). CC>TT double substitutions were also skewed in this manner. CpG dinucleotides are also to be hotspots and showed elevated mutation frequencies. Transcription-coupled repair repairs the transcribed strand more efficiently than the non-transcribed strand, and in concordance with this in transcribed regions there was nearly a 2:1 bias of C>T changes on the non-transcribed strand. However, the authors state (but I still haven't quite figured out the logic) that transcription-coupled repair can account for only 1/3 of the bias and suggest that another mechanism, previously suspected but not characterized, is at work. One final consequence of transcription-coupled repair is that the more expressed a gene is in COLO-829, the lower its mutational burden. A bias of mutations towards the 3' end of transcribed regions was also observed, perhaps because 5' ends are transcribed at higher levels (due to abortive transcription). A transcribed-strand bias was also seen in G>T mutations, which may be oxidative damage.
An additional angle on mutations in the COLO-829 melanoma line is offered by the observation of copy-neutral loss of heterozygosity (LOH) in some regions. In other words, one copy of a chromosome was lost but then replaced by a duplicate of the remaining copy. This analysis is enabled by having the sequence of the normal DNA to identify germline heterozygosity. Interestingly, in these regions heterzyogous mutations outnumber homozygous ones, marking that these substitutions occurred after the reduplication event. 82% of C>T mutations in these regions show the hallmarks of being early mutations, suggesting they occured late, perhaps after the melanoma metastasized and was therefore removed from ultraviolet exposure.
In a similar manner, there is a rich amount of information in the SCLC mutational data. I'll skip over a bunch to hit the evidence for a novel transcription-coupled repair pathway that operates on both strands. The key point is that highly expressed genes had lower mutation rates on both strands than less expressed genes. A>G mutations showed a bias for the transcribed strand whereas G>A mutations occured equally on each strand.
Now, I'll confess I don't generally get excited about looking a mutation spectra. A lot of this has been published before, though these papers offer a particulary rich and low-bias look. What I'm most interested in are recurrent mutations and rearrangements that may be driving the cancer, particularly if they suggest therapeutic interventions. The melanoma line contained two missense mutations in the gene SPDEF, which has been associated with multiple solid tumors. A truncating stop mutation was found by sequencing SPDEF out of 48 additional tumors. A missense change was found in a metalloprotease (MMP28) which has previously been observed to be mutated in melanoma. Another missense mutation was found in agene which may play a role in ultraviolet repair (though it has been implicated in other processes), suggesting a tumor suppressor role. The sequencing results confirmed two out of three known driver mutations in COLO-829: the V600E activating mutation in kinase BRAF and deletion of the tumor suppressor PTEN. As noted above, the know 2 bp deletion in CDKN2A was not found through the automated process.
The SCLC sample has a few candidates for interestingly mutated genes. A fusion gene in which one partner (CREBBP) has been seen in leukemia gene fusions was found. An intragenic tandem duplication within the chromatin remodelling gene CHD7 was found which should generate an in-frame duplication of exons. Another SCLC cell line (NCI-H2171) was previously known to have a fusion gene involving CHD7. Screening of 63 other SCLC cell lines identified another (LU-135) with internal exon copy number alterations. Lu-135 was further explored by mate pair sequencing witha 3-4Kb library, which identified a breakpoint involving CHD7. Expression analysis showed high expression levels of CHD7 in both LU-135 and NCI-H2171 and a general higher expression of CHD7 in SCLC lines than non-small cell lung cancer lines and other tumor cell lines. An interesting twist is that the fusion partner in NCI-H2171 abd KY-135 is a non-coding RNA gene called PVT1 -- which is thought to be a transcriptional target of the oncogene MYC. MYC is amplified in both these cell lines, suggesting multiple biological mechanisms resulting in high expression of CHD7. It would seem reasonable to expect some high profile functional studies of CHD7 in the not too distant future.
For functional point mutations, the natural place to look is at coding regions and splice junctions, as here we have the strongest models for ranking the likelihood that a mutation will have a biological effect. In the SCLC paper an effort was made to push this a bit further and look for mutations that might affect transcription factor binding sites. One candidate was found but not further explored.
In general, this last point underlines what I believe will be different about subsequent papers. Looking mostly at a single cancer sample, one is limited at one can be inferred. The mutational spectrum work is something which a single tumor can illustrate in detail, and such in depth analyses will probably be significant parts of the first tumor sequencing paper for each tumor type, particularly other types with strong environmental or genetic mutational components. But, in terms of learnign what make cancers tick and how we can interfere with that, the real need is to find recurrent targets of mutation. Various cancer genome centers have been promising a few hundred tumors sequenced over the next year. Already at the recent ASH meeting (which I did not attend), there were over a half dozen presentations or posters on whole genome or exome sequencing of leukemias, lymphomas and myelomas -- the first ripples of the tsunami to come. But, the raw cost of targeted sequencing remains at most a 10th of the cost of an entire genome. The complete set of mutations found in either one of these papers could have been packed onto a single oligo based capture scheme and certainly a high-priority subset could be amplified by PCR without breaking the bank on oligos. I would expect that in the near future tumor sequencing papers will check their mutations and rearrangements on validation panels of at least 50 and preferable hundreds of samples (though assembling such sample collections is definitely not trivial). This will allow the estimation of the population frequency of those mutations which may recur at the level of 5-10% or more. With luck, some of those will suggest pharmacologic interventions which can be tested for their ability to improve patients' lives.

Pleasance, E., Stephens, P., O’Meara, S., McBride, D., Meynert, A., Jones, D., Lin, M., Beare, D., Lau, K., Greenman, C., Varela, I., Nik-Zainal, S., Davies, H., Ordoñez, G., Mudie, L., Latimer, C., Edkins, S., Stebbings, L., Chen, L., Jia, M., Leroy, C., Marshall, J., Menzies, A., Butler, A., Teague, J., Mangion, J., Sun, Y., McLaughlin, S., Peckham, H., Tsung, E., Costa, G., Lee, C., Minna, J., Gazdar, A., Birney, E., Rhodes, M., McKernan, K., Stratton, M., Futreal, P., & Campbell, P. (2009). A small-cell lung cancer genome with complex signatures of tobacco exposure Nature DOI: 10.1038/nature08629
Pleasance, E., Cheetham, R., Stephens, P., McBride, D., Humphray, S., Greenman, C., Varela, I., Lin, M., Ordóñez, G., Bignell, G., Ye, K., Alipaz, J., Bauer, M., Beare, D., Butler, A., Carter, R., Chen, L., Cox, A., Edkins, S., Kokko-Gonzales, P., Gormley, N., Grocock, R., Haudenschild, C., Hims, M., James, T., Jia, M., Kingsbury, Z., Leroy, C., Marshall, J., Menzies, A., Mudie, L., Ning, Z., Royce, T., Schulz-Trieglaff, O., Spiridou, A., Stebbings, L., Szajkowski, L., Teague, J., Williamson, D., Chin, L., Ross, M., Campbell, P., Bentley, D., Futreal, P., & Stratton, M. (2009). A comprehensive catalogue of somatic mutations from a human cancer genome Nature DOI: 10.1038/nature08658
Both papers have a good deal of similar structure, perhaps because their author lists share a lot of overlap, including the same first author. However, technically they are quite different. The melanoma sequencing used the Illumina GAII, generating 2x75 paired end reads supplemented with 50x2 paired end reads from 3-4Kb inserts, whereas the SCLC paper used 2x25 mate pair SOLiD libraries with inserts between 400 and 3000 bp.
The papers have estimates of the false positive and false negative rates for the detection of various mutations, in comparison to Sanger data. For single base pair substitutions on the Illumina platform in the melanoma sample, 88% of previously known variants were found and 97% of a sample of 470 newly found variants confirmed by Sanger. However, on small insertion/deletion (indel) there was both less data and much less success. Only one small deletion was previously known, a 2 base deletion which is key to the biology. This was not found by the automated alignment and analysis, though reads containing this indel could be found in the data. A sample of 182 small indels were checked by Sanger and only 36% were confirmed. On large rearrangements, 75% of those tested confirmed by PCR.
The statistics for the SOLiD data in SCLC were comparable. 76% of previously known single nucleotide variants were found and 97% of newly found variants confirmed by Sanger. Two small indels were previously known and neither was found and conversely only 25% of predicted indels confirmed by Sanger. 100% of large rearrangements tested by PCR validated. So overall, both platforms do well for detecting rearrangements and substitutions and are very weak for small indels.
The overall mutation hauls were large, after filtering out variants found in the normal cell line. 22,910 substitutions for the SCLC line and 33,345 in the melanoma line. Both of these samples reflect serious environmental abuse; melanomas often arise from sun exposure and the particular cancer morphology the SCLC line is derived from is characteristic of smokers (the smoking history of the patient was unknown). Both lines showed mutation spectra in agreement with what is previously known about these environmental insults. 92% of C>T single substitutions occured at the second base of a pyrimidne dimers (CC or CT sequences). CC>TT double substitutions were also skewed in this manner. CpG dinucleotides are also to be hotspots and showed elevated mutation frequencies. Transcription-coupled repair repairs the transcribed strand more efficiently than the non-transcribed strand, and in concordance with this in transcribed regions there was nearly a 2:1 bias of C>T changes on the non-transcribed strand. However, the authors state (but I still haven't quite figured out the logic) that transcription-coupled repair can account for only 1/3 of the bias and suggest that another mechanism, previously suspected but not characterized, is at work. One final consequence of transcription-coupled repair is that the more expressed a gene is in COLO-829, the lower its mutational burden. A bias of mutations towards the 3' end of transcribed regions was also observed, perhaps because 5' ends are transcribed at higher levels (due to abortive transcription). A transcribed-strand bias was also seen in G>T mutations, which may be oxidative damage.
An additional angle on mutations in the COLO-829 melanoma line is offered by the observation of copy-neutral loss of heterozygosity (LOH) in some regions. In other words, one copy of a chromosome was lost but then replaced by a duplicate of the remaining copy. This analysis is enabled by having the sequence of the normal DNA to identify germline heterozygosity. Interestingly, in these regions heterzyogous mutations outnumber homozygous ones, marking that these substitutions occurred after the reduplication event. 82% of C>T mutations in these regions show the hallmarks of being early mutations, suggesting they occured late, perhaps after the melanoma metastasized and was therefore removed from ultraviolet exposure.
In a similar manner, there is a rich amount of information in the SCLC mutational data. I'll skip over a bunch to hit the evidence for a novel transcription-coupled repair pathway that operates on both strands. The key point is that highly expressed genes had lower mutation rates on both strands than less expressed genes. A>G mutations showed a bias for the transcribed strand whereas G>A mutations occured equally on each strand.
Now, I'll confess I don't generally get excited about looking a mutation spectra. A lot of this has been published before, though these papers offer a particulary rich and low-bias look. What I'm most interested in are recurrent mutations and rearrangements that may be driving the cancer, particularly if they suggest therapeutic interventions. The melanoma line contained two missense mutations in the gene SPDEF, which has been associated with multiple solid tumors. A truncating stop mutation was found by sequencing SPDEF out of 48 additional tumors. A missense change was found in a metalloprotease (MMP28) which has previously been observed to be mutated in melanoma. Another missense mutation was found in agene which may play a role in ultraviolet repair (though it has been implicated in other processes), suggesting a tumor suppressor role. The sequencing results confirmed two out of three known driver mutations in COLO-829: the V600E activating mutation in kinase BRAF and deletion of the tumor suppressor PTEN. As noted above, the know 2 bp deletion in CDKN2A was not found through the automated process.
The SCLC sample has a few candidates for interestingly mutated genes. A fusion gene in which one partner (CREBBP) has been seen in leukemia gene fusions was found. An intragenic tandem duplication within the chromatin remodelling gene CHD7 was found which should generate an in-frame duplication of exons. Another SCLC cell line (NCI-H2171) was previously known to have a fusion gene involving CHD7. Screening of 63 other SCLC cell lines identified another (LU-135) with internal exon copy number alterations. Lu-135 was further explored by mate pair sequencing witha 3-4Kb library, which identified a breakpoint involving CHD7. Expression analysis showed high expression levels of CHD7 in both LU-135 and NCI-H2171 and a general higher expression of CHD7 in SCLC lines than non-small cell lung cancer lines and other tumor cell lines. An interesting twist is that the fusion partner in NCI-H2171 abd KY-135 is a non-coding RNA gene called PVT1 -- which is thought to be a transcriptional target of the oncogene MYC. MYC is amplified in both these cell lines, suggesting multiple biological mechanisms resulting in high expression of CHD7. It would seem reasonable to expect some high profile functional studies of CHD7 in the not too distant future.
For functional point mutations, the natural place to look is at coding regions and splice junctions, as here we have the strongest models for ranking the likelihood that a mutation will have a biological effect. In the SCLC paper an effort was made to push this a bit further and look for mutations that might affect transcription factor binding sites. One candidate was found but not further explored.
In general, this last point underlines what I believe will be different about subsequent papers. Looking mostly at a single cancer sample, one is limited at one can be inferred. The mutational spectrum work is something which a single tumor can illustrate in detail, and such in depth analyses will probably be significant parts of the first tumor sequencing paper for each tumor type, particularly other types with strong environmental or genetic mutational components. But, in terms of learnign what make cancers tick and how we can interfere with that, the real need is to find recurrent targets of mutation. Various cancer genome centers have been promising a few hundred tumors sequenced over the next year. Already at the recent ASH meeting (which I did not attend), there were over a half dozen presentations or posters on whole genome or exome sequencing of leukemias, lymphomas and myelomas -- the first ripples of the tsunami to come. But, the raw cost of targeted sequencing remains at most a 10th of the cost of an entire genome. The complete set of mutations found in either one of these papers could have been packed onto a single oligo based capture scheme and certainly a high-priority subset could be amplified by PCR without breaking the bank on oligos. I would expect that in the near future tumor sequencing papers will check their mutations and rearrangements on validation panels of at least 50 and preferable hundreds of samples (though assembling such sample collections is definitely not trivial). This will allow the estimation of the population frequency of those mutations which may recur at the level of 5-10% or more. With luck, some of those will suggest pharmacologic interventions which can be tested for their ability to improve patients' lives.

Pleasance, E., Stephens, P., O’Meara, S., McBride, D., Meynert, A., Jones, D., Lin, M., Beare, D., Lau, K., Greenman, C., Varela, I., Nik-Zainal, S., Davies, H., Ordoñez, G., Mudie, L., Latimer, C., Edkins, S., Stebbings, L., Chen, L., Jia, M., Leroy, C., Marshall, J., Menzies, A., Butler, A., Teague, J., Mangion, J., Sun, Y., McLaughlin, S., Peckham, H., Tsung, E., Costa, G., Lee, C., Minna, J., Gazdar, A., Birney, E., Rhodes, M., McKernan, K., Stratton, M., Futreal, P., & Campbell, P. (2009). A small-cell lung cancer genome with complex signatures of tobacco exposure Nature DOI: 10.1038/nature08629
Pleasance, E., Cheetham, R., Stephens, P., McBride, D., Humphray, S., Greenman, C., Varela, I., Lin, M., Ordóñez, G., Bignell, G., Ye, K., Alipaz, J., Bauer, M., Beare, D., Butler, A., Carter, R., Chen, L., Cox, A., Edkins, S., Kokko-Gonzales, P., Gormley, N., Grocock, R., Haudenschild, C., Hims, M., James, T., Jia, M., Kingsbury, Z., Leroy, C., Marshall, J., Menzies, A., Mudie, L., Ning, Z., Royce, T., Schulz-Trieglaff, O., Spiridou, A., Stebbings, L., Szajkowski, L., Teague, J., Williamson, D., Chin, L., Ross, M., Campbell, P., Bentley, D., Futreal, P., & Stratton, M. (2009). A comprehensive catalogue of somatic mutations from a human cancer genome Nature DOI: 10.1038/nature08658
Monday, December 14, 2009
Panda Genome Published!
Today's big genomics news is the advance publication in Nature of the giant panda (aka panda bear) genome sequence. For I'll be fighting someone (TNG) for my copy of Nature!
Pandas are the first bear (and alas, there is already someone making the mistaken claim otherwise in the Nature online comments) and only second member of Carnivora (after dog) with a draft sequence. Little in the genome sequence suggests that they have abandoned meat for a nearly all-plant diet, other than an apparent knockout of the taste receptor for glutamate, a key component of the taste of meat. So if you prepare bamboo for the pandas, don't bother with any MSG! But pandas do not appear to have acquired enzymes for attacking their bamboo, suggesting that their gut microflora do a lot of the work. So a panda microbiome metagenome project is clearly on the horizon. The sequence also greatly advances panda genetics: only 13 panda genes were previously sequenced.
The assembly is notable for being composed entirely of Solexa data using a mixture of library insert lengths. One issue touched on here (and I've seen commented on elsewhere) is that the longer mate pair libraries have serious chimaera issues and were not trusted to simply be fed into the assembly program, but were carefully added in a stepwise fashion (stepping up in library length) during later stages of assembly. It will be interesting to see what the Pacific Biosciences instrument can do in this regard -- instead trying to edit out the middle of large inserts by enzymatic and/or physical means, PacBio apparently has a "dark fill" procedure of pulsing unlabeled nucleotides. This leads to islands of sequence separated by signal gaps of known time, which can be be used to estimate distance. Presumably such an approach will not have chimaeras though the raw base error rate may be higher.
I'm quite confused by their Table 1, which shows the progress of their assembly as different data was added in. The confusing part is that it shows the progressive improvement in the N50 and N90 numbers with each step -- and then much worse numbers for the final assembly. The final N50 is 40Kb, which is substantially shorter than dog (close to 100Kb) but longer than platypus (13 kb). It strikes me that a useful additional statistic (or actually set of statistics) for a mammalian genome would be to calculste what fraction of core mammalian genes (which would have to be defined) are contained on a single contig (or for what fraction will you find at least 50% of the coding region in one contig).
While the greatest threat to panda's continuing existence in the wild is habitat destruction, it is heartening to find out that pandas have a high degree of genetic variability -- almost twice the heterozygosity of people. So there is apparently a lot of genetic diversity packed into the small panda population (around 1600 individuals, based on DNA sampling of scat)
BTW, no that is not the subject panda (Jingjing, who was the mascot for the Beijing Olympics) but rather my shot from our pilgrimage last summer to the San Diego Zoo. I think that is Gao Gao, but I'm not good about noting such things.
(update: forgot to put the Research Blogging bit in the post)

Li, R., Fan, W., Tian, G., Zhu, H., He, L., Cai, J., Huang, Q., Cai, Q., Li, B., Bai, Y., Zhang, Z., Zhang, Y., Wang, W., Li, J., Wei, F., Li, H., Jian, M., Li, J., Zhang, Z., Nielsen, R., Li, D., Gu, W., Yang, Z., Xuan, Z., Ryder, O., Leung, F., Zhou, Y., Cao, J., Sun, X., Fu, Y., Fang, X., Guo, X., Wang, B., Hou, R., Shen, F., Mu, B., Ni, P., Lin, R., Qian, W., Wang, G., Yu, C., Nie, W., Wang, J., Wu, Z., Liang, H., Min, J., Wu, Q., Cheng, S., Ruan, J., Wang, M., Shi, Z., Wen, M., Liu, B., Ren, X., Zheng, H., Dong, D., Cook, K., Shan, G., Zhang, H., Kosiol, C., Xie, X., Lu, Z., Zheng, H., Li, Y., Steiner, C., Lam, T., Lin, S., Zhang, Q., Li, G., Tian, J., Gong, T., Liu, H., Zhang, D., Fang, L., Ye, C., Zhang, J., Hu, W., Xu, A., Ren, Y., Zhang, G., Bruford, M., Li, Q., Ma, L., Guo, Y., An, N., Hu, Y., Zheng, Y., Shi, Y., Li, Z., Liu, Q., Chen, Y., Zhao, J., Qu, N., Zhao, S., Tian, F., Wang, X., Wang, H., Xu, L., Liu, X., Vinar, T., Wang, Y., Lam, T., Yiu, S., Liu, S., Zhang, H., Li, D., Huang, Y., Wang, X., Yang, G., Jiang, Z., Wang, J., Qin, N., Li, L., Li, J., Bolund, L., Kristiansen, K., Wong, G., Olson, M., Zhang, X., Li, S., Yang, H., Wang, J., & Wang, J. (2009). The sequence and de novo assembly of the giant panda genome Nature DOI: 10.1038/nature08696
Sunday, November 22, 2009
Targeted Sequencing Bags a Diagnosis
A nice complement to the one paper (Ng et al) I detailed last week is a paper that actually came out just before hand (Choi et al). Whereas the Ng paper used whole exome targeted sequencing to find the mutation for a previously unexplained rare genetic disease, the Choi et al paper used a similar scheme (though with a different choice of targeting platform) to find a known mutation in a patient, thereby diagnosing the patient.
The patient in question has a tightly interlocked pedigree (Figure 2), with two different consanguineous marriages shown. Put another way, this person could trace 3 paths back to one set of great-great-grandparents. Hence, they had quite a bit of DNA which was identical-by-descent, which meant that in these regions any low-frequency variant call could be safely ignored as noise. A separate scan with a SNP chip was used to identify such regions independently of the sequencing.
The patient was a 5 month old male, born prematurely at 30 weeks and with "failure to thrive and dehydration". Two spontaneous abortions and a death of another premature sibling at day 4 also characterized this family; a litany of miserable suffering. Due to imbalances in the standard blood chemistry (which, I wish the reviewers had insisted on further explanation for those of us who don't frequent that world), a kidney defect was suspected but other causes (such as infection) were not excluded.
The exome capture was this time on the Nimblegen platform, followed by Illumina sequenicng. This is not radically different from the Ng paper, which used Agilent capture and Illumina sequencing. At the moment Illumina & Agilent appear to be the only practical options for whole exome-scale capture, though there are many capture schemes published and quite a few available commercially. Lots of variants were found. One that immediately grabbed attention was a novel missense mutation which was homozygous and in a known chloride transporter, SLC26A3. This missense mutation (D652N)targets a position which is almost utterly conserved across the family, and is making a significant change in side chain (acid group to polar non-charged). Most importantly, SLC26A3 has already been shown to cause "congenital chloride-losing diarrhea" (CLD) when mutated in other positions. Clinical follow-up confirmed that fluid loss was through the intestines and not the kidneys.
One of the genetic diseases of the kidney that had been considered was Bartter syndrome, which the more precise blood chemistry did not match. Given that one patient had been suspected of Bartter but instead had CLD, the group screened 39 more patients with Bartter but lacking mutations in 4 different genes linked to this syndrome. 5 of these patients had homozygous mutations in SLC26A3, 2 of which were novel. 190 control chromosomes were also sequenced; none had mutations. 3 of these patients had further follow-up & confirmation of water loss through the gastrointestinal tract.
This study again illustrates the utility of targeted sequencing for clinical diagnosis of difficult cases. While a whole exome scan is currently in the neighborhood of $20K, more focused searches could be run far cheaper. The challenge will be in designing economical panels which will allow scanning the most important genes at low cost and designing such panels well. Presumably one could go through OMIM and find all diseases & syndromes which alter electrolyte levels and known causative gene(s). Such panels might be doable for perhaps as low as $1-5K per sample; too expensive for routine newborn screening but far better than a endless stream of tests. Of course, such panels would miss novel genes or really odd presentations, so follow-up of negative results with whole exome sequencing might be required. With newer sequencing platforms available, the costs for this may plummet to a few hundred dollars per test, which is probably on par with what the current screening of newborns for inborn errors runs. One impediment to commercial development in this field may well be the rapid evolution of platforms; companies may be hesitant that they will bet on a technology that will not last.
Of course, to some degree the distinction between the two papers is artificial. The Ng et al paper actually, as I noted, did diagnose some of their patients with known genetic disease. Similarly, the patients in this study who are now negative for known Bartter syndrome genes and for CLD would be candidates for whole exome sequencing. In the end, what matters is to make the right diagnosis for each patient so that the best treatment or supportive care can be selected.

Choi M, Scholl UI, Ji W, Liu T, Tikhonova IR, Zumbo P, Nayir A, BakkaloÄŸlu A, Ozen S, Sanjad S, Nelson-Williams C, Farhi A, Mane S, & Lifton RP (2009). Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proceedings of the National Academy of Sciences of the United States of America, 106 (45), 19096-101 PMID: 19861545
The patient in question has a tightly interlocked pedigree (Figure 2), with two different consanguineous marriages shown. Put another way, this person could trace 3 paths back to one set of great-great-grandparents. Hence, they had quite a bit of DNA which was identical-by-descent, which meant that in these regions any low-frequency variant call could be safely ignored as noise. A separate scan with a SNP chip was used to identify such regions independently of the sequencing.
The patient was a 5 month old male, born prematurely at 30 weeks and with "failure to thrive and dehydration". Two spontaneous abortions and a death of another premature sibling at day 4 also characterized this family; a litany of miserable suffering. Due to imbalances in the standard blood chemistry (which, I wish the reviewers had insisted on further explanation for those of us who don't frequent that world), a kidney defect was suspected but other causes (such as infection) were not excluded.
The exome capture was this time on the Nimblegen platform, followed by Illumina sequenicng. This is not radically different from the Ng paper, which used Agilent capture and Illumina sequencing. At the moment Illumina & Agilent appear to be the only practical options for whole exome-scale capture, though there are many capture schemes published and quite a few available commercially. Lots of variants were found. One that immediately grabbed attention was a novel missense mutation which was homozygous and in a known chloride transporter, SLC26A3. This missense mutation (D652N)targets a position which is almost utterly conserved across the family, and is making a significant change in side chain (acid group to polar non-charged). Most importantly, SLC26A3 has already been shown to cause "congenital chloride-losing diarrhea" (CLD) when mutated in other positions. Clinical follow-up confirmed that fluid loss was through the intestines and not the kidneys.
One of the genetic diseases of the kidney that had been considered was Bartter syndrome, which the more precise blood chemistry did not match. Given that one patient had been suspected of Bartter but instead had CLD, the group screened 39 more patients with Bartter but lacking mutations in 4 different genes linked to this syndrome. 5 of these patients had homozygous mutations in SLC26A3, 2 of which were novel. 190 control chromosomes were also sequenced; none had mutations. 3 of these patients had further follow-up & confirmation of water loss through the gastrointestinal tract.
This study again illustrates the utility of targeted sequencing for clinical diagnosis of difficult cases. While a whole exome scan is currently in the neighborhood of $20K, more focused searches could be run far cheaper. The challenge will be in designing economical panels which will allow scanning the most important genes at low cost and designing such panels well. Presumably one could go through OMIM and find all diseases & syndromes which alter electrolyte levels and known causative gene(s). Such panels might be doable for perhaps as low as $1-5K per sample; too expensive for routine newborn screening but far better than a endless stream of tests. Of course, such panels would miss novel genes or really odd presentations, so follow-up of negative results with whole exome sequencing might be required. With newer sequencing platforms available, the costs for this may plummet to a few hundred dollars per test, which is probably on par with what the current screening of newborns for inborn errors runs. One impediment to commercial development in this field may well be the rapid evolution of platforms; companies may be hesitant that they will bet on a technology that will not last.
Of course, to some degree the distinction between the two papers is artificial. The Ng et al paper actually, as I noted, did diagnose some of their patients with known genetic disease. Similarly, the patients in this study who are now negative for known Bartter syndrome genes and for CLD would be candidates for whole exome sequencing. In the end, what matters is to make the right diagnosis for each patient so that the best treatment or supportive care can be selected.

Choi M, Scholl UI, Ji W, Liu T, Tikhonova IR, Zumbo P, Nayir A, BakkaloÄŸlu A, Ozen S, Sanjad S, Nelson-Williams C, Farhi A, Mane S, & Lifton RP (2009). Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proceedings of the National Academy of Sciences of the United States of America, 106 (45), 19096-101 PMID: 19861545
Thursday, November 19, 2009
Three Blows Against the Tyranny of Expensive Experiments
Second generation sequencing is great, but one of it's major issues so far is that the cost of one experiment is quite steep. Just looking at reagents, going from a ready-to-run library to sequence data is somewhere in the neighborhood of $10K-25K on 454, Illumina, Helicos or SOLiD (I'm willing to take corrections on these values, though they are based on reasonable intelligence). While in theory you can split this cost over multiple experiments by barcoding, that can be very tricky to arrange. Perhaps if core labs would start offering '1 lane of Illumina - Buy It Now!' on eBay the problem could be solved, but finding a spare lane isn't easy.
This issue manifests itself in other ways. If you are developing new protocols anywhere along the pipeline, your final assay is pretty expensive, making it challenging to work inexpensively. I've heard rumors that even some of the instrument makers feel inhibited in process development. It can also make folks a bit gun shy; Amanda heard first hand tonight from someone lamenting a project stymied under such circumstances. Even for routine operations, the methods of QC are pretty inexact so far as they don't really test whether the library is any good, just whether some bulk property (size, PCRability, quantity) is within a spec. This huge atomic cost also the huge barrier to utilization in a clinical setting; does the clinician really want to wait some indefinite amount of time until enough patient samples are queued to make the cost/sample reasonable?
Recently, I've become aware of three hopeful developments on this front. The first is the Polonator, which according to Kevin McCarthy has a consumable cost of only about $500 per run (post library construction). $500 isn't nothing to risk on a crazy idea, but it sure beats $10K. There aren't many Polonators around, but for method development in areas such as targeted capture it would seem like a great choice.
Today, another shoe fell. Roche has announced a smaller version of the 454 system, the GS Junior. While the instrument cost wasn't announced, it will supposedly generate 1/10th as much data (35+Mb from 100Kreads with 400 Q20 bases) for the same cost per basepair, suggesting that the reagent cost for a run will be in the neighborhood of $2.5K. Worse than what I described above, but rather intriguing. This is a system that may have a good chance to start making clinical inroads; $2.5K is a bit steep for a diagnostic but not ridiculous -- or you simply need to multiplex fewer samples to get the cost per sample decent. The machine is going to boast 400+bp reads, playing to the current comparative strength of the 454 chemistry. The instrument cost wasn't mentioned. While I doubt anyone would buy such a machine solely as an upfront QC for SOLiD or Illumina, with some clever custom primer design one probably could make libraries useable 454 plus one other platform.
It's an especially auspicious time for Roche to launch their baby 454, as Pacific Biosciences released some specs through GenomeWeb's In Sequence and what I've been able to scrounge about (I can't quite talk myself into asking for a subscription) this is going to put some real pressure across the market, but particularly on 454. The key specs I can find are a per run cost of $100 which will get you approximately 25K-30K reads of 1.5Kb each -- or around 45Mb of data. It may also be possible to generate 2X the data for nearly the same cost; apparently the reagents packed with one cell are really good for two run in series. Each cell takes 10-15 minutes to run (at least in some workflows) and the instrument can be loaded up with 96 of them to be handled serially. This is a similar ballpark to what the GS Junior is being announced with, though with fewer reads but longer read lengths. I haven't been able to find any error rate estimates or the instrument cost. I'll assume, just because it is new and single molecule, that the error rate will give Roche some breathing room.
But in general, PacBio looks set to really grab the market where long reads, even noisy ones, are valuable. One obvious use case is transcriptome sequencing to find alternative splice forms. Another would be to provide 1.5Kb scaffolds for genome assembly; what I've found also suggests PacBio will offer a 'strobe sequencing' mode which is akin to Helicos' dark filling technology, which is a means to get widely spaced sequence islands. This might provide scaffolding information in much larger fragments. 10Kb? 20Kb? And again, though you probably wouldn't buy the machine just for this, at $100/run it looks like a great way to QC samples going into other systems. Imagine checking a library after initial construction, then after performing hybridization selection and then after another round of selection! After all, the initial PacBio instrument won't be great for really deep sequencing. It appears it would be $5K-10K to get approximately 1X coverage of a mammalian genome -- but likely with a high error rate.
With the ability to easily sequence 96 samples at a time (though it isn't clear what sample prep will entail) does have some interesting suggestions. For example, one could do long survey sequencing of many bacterial species, with each well yielding 10X coverage of an E.coli-sized genome (a lot of bugs are this size or smaller). The data might be really noisy, but for getting a general lay-of-the-land it could be quite useful -- perhaps the data would be too noisy to tell which genes were actually functional vs. decaying pseudogenes, but you would be able to ask "what is the upper bound on the number of genes of protein family X in genome Y". if you really need high quality sequence, then a full run (or targeted sequencing) could follow.
At $100 per experiment, the sagging Sanger market might take another hit. If a quick sample prep to convert plasmids to usable form is released, then ridiculous oversampling (imagine 100K reads on a typical 1.5Kb insert in pUC scenario!) might overcome a high error rate.
One interesting impediment which PacBio has acknowledged is that they won't be able to ramp up instrument production as quickly as they might like and will be trying to place (ration) instruments strategically. I'm hoping at least one goes to a commercial service provider or a core lab willing to solicit outside business, but I'm not going to count on it.
Will Illumina & Life Technologies (SOLiD) try to create baby sequencers? Illumina does have a scheme to convert their array readers to sequencers, but from what I've seen these aren't expected to save much on reagents. Life does own the VisiGen technology, which is apparently similar to PacBio's but hasn't yet published a real proof-of-concept paper -- at least that I could find; their key patent has issued -- reading material for another night.
This issue manifests itself in other ways. If you are developing new protocols anywhere along the pipeline, your final assay is pretty expensive, making it challenging to work inexpensively. I've heard rumors that even some of the instrument makers feel inhibited in process development. It can also make folks a bit gun shy; Amanda heard first hand tonight from someone lamenting a project stymied under such circumstances. Even for routine operations, the methods of QC are pretty inexact so far as they don't really test whether the library is any good, just whether some bulk property (size, PCRability, quantity) is within a spec. This huge atomic cost also the huge barrier to utilization in a clinical setting; does the clinician really want to wait some indefinite amount of time until enough patient samples are queued to make the cost/sample reasonable?
Recently, I've become aware of three hopeful developments on this front. The first is the Polonator, which according to Kevin McCarthy has a consumable cost of only about $500 per run (post library construction). $500 isn't nothing to risk on a crazy idea, but it sure beats $10K. There aren't many Polonators around, but for method development in areas such as targeted capture it would seem like a great choice.
Today, another shoe fell. Roche has announced a smaller version of the 454 system, the GS Junior. While the instrument cost wasn't announced, it will supposedly generate 1/10th as much data (35+Mb from 100Kreads with 400 Q20 bases) for the same cost per basepair, suggesting that the reagent cost for a run will be in the neighborhood of $2.5K. Worse than what I described above, but rather intriguing. This is a system that may have a good chance to start making clinical inroads; $2.5K is a bit steep for a diagnostic but not ridiculous -- or you simply need to multiplex fewer samples to get the cost per sample decent. The machine is going to boast 400+bp reads, playing to the current comparative strength of the 454 chemistry. The instrument cost wasn't mentioned. While I doubt anyone would buy such a machine solely as an upfront QC for SOLiD or Illumina, with some clever custom primer design one probably could make libraries useable 454 plus one other platform.
It's an especially auspicious time for Roche to launch their baby 454, as Pacific Biosciences released some specs through GenomeWeb's In Sequence and what I've been able to scrounge about (I can't quite talk myself into asking for a subscription) this is going to put some real pressure across the market, but particularly on 454. The key specs I can find are a per run cost of $100 which will get you approximately 25K-30K reads of 1.5Kb each -- or around 45Mb of data. It may also be possible to generate 2X the data for nearly the same cost; apparently the reagents packed with one cell are really good for two run in series. Each cell takes 10-15 minutes to run (at least in some workflows) and the instrument can be loaded up with 96 of them to be handled serially. This is a similar ballpark to what the GS Junior is being announced with, though with fewer reads but longer read lengths. I haven't been able to find any error rate estimates or the instrument cost. I'll assume, just because it is new and single molecule, that the error rate will give Roche some breathing room.
But in general, PacBio looks set to really grab the market where long reads, even noisy ones, are valuable. One obvious use case is transcriptome sequencing to find alternative splice forms. Another would be to provide 1.5Kb scaffolds for genome assembly; what I've found also suggests PacBio will offer a 'strobe sequencing' mode which is akin to Helicos' dark filling technology, which is a means to get widely spaced sequence islands. This might provide scaffolding information in much larger fragments. 10Kb? 20Kb? And again, though you probably wouldn't buy the machine just for this, at $100/run it looks like a great way to QC samples going into other systems. Imagine checking a library after initial construction, then after performing hybridization selection and then after another round of selection! After all, the initial PacBio instrument won't be great for really deep sequencing. It appears it would be $5K-10K to get approximately 1X coverage of a mammalian genome -- but likely with a high error rate.
With the ability to easily sequence 96 samples at a time (though it isn't clear what sample prep will entail) does have some interesting suggestions. For example, one could do long survey sequencing of many bacterial species, with each well yielding 10X coverage of an E.coli-sized genome (a lot of bugs are this size or smaller). The data might be really noisy, but for getting a general lay-of-the-land it could be quite useful -- perhaps the data would be too noisy to tell which genes were actually functional vs. decaying pseudogenes, but you would be able to ask "what is the upper bound on the number of genes of protein family X in genome Y". if you really need high quality sequence, then a full run (or targeted sequencing) could follow.
At $100 per experiment, the sagging Sanger market might take another hit. If a quick sample prep to convert plasmids to usable form is released, then ridiculous oversampling (imagine 100K reads on a typical 1.5Kb insert in pUC scenario!) might overcome a high error rate.
One interesting impediment which PacBio has acknowledged is that they won't be able to ramp up instrument production as quickly as they might like and will be trying to place (ration) instruments strategically. I'm hoping at least one goes to a commercial service provider or a core lab willing to solicit outside business, but I'm not going to count on it.
Will Illumina & Life Technologies (SOLiD) try to create baby sequencers? Illumina does have a scheme to convert their array readers to sequencers, but from what I've seen these aren't expected to save much on reagents. Life does own the VisiGen technology, which is apparently similar to PacBio's but hasn't yet published a real proof-of-concept paper -- at least that I could find; their key patent has issued -- reading material for another night.
Sunday, November 15, 2009
Targeted Sequencing Bags a Rare Disease
Nature Genetics on Friday released the paper from Jay Shendure, Debra Nickerson and colleagues which used targeted sequencing to identify the damaged gene in a rare Mendelian disorder, Miller syndrome. The work had been presented at least in part at recent meetings, but now all of us can digest it in entirety.
The impressive economy of this paper is that they targeted (using Agilent chips) less than 30Mb of the human genome, which is less than 1%. They also worked with very few samples; only about 30 cases of Miller Syndrome have been reported in the literature. While I've expressed some reservations about "exome sequencing", this paper does illustrate why it can be very cost effective and my objections (perhaps not made clear enough before) is more a worry about being too restricted to "exomes" and less about targeting.
Only four affected individuals (two siblings and two individuals unrelated to anyone else in the study) were sequenced, each at around 40X coverage of the targeted regions. Since Miller is so vanishingly rare, the causative mutations should be absent from samples of human diversity such as dbSNP or the HapMap, so these was used as a filter. Non-synonymous (protein-altering), splice site mutations & coding indels were considered as candidates. Both dominant models and recessive models were considered. Combining the data from both siblings, 228 candidate dominant genes and 9 recessive ones fell out. Looking then to the unrelated individuals zeroed in on a single gene, DHODH, under the recessive model (but 8 in the dominant model). Using a conservative statistical model, the odds of finding this by chance were estimated at 1.5x10e-05.
An interesting curve was thrown by nature. If predictions were made as to whether mutations would be damaging, then DHODH was excluded as a candidate gene under a recessive model. Both siblings carried one allele (G605A) predicted to be neutral but another allele predicted to be damaging.
Another interesting curve is a second gene, DNAH5, which was a candidate considering only the siblings' data but ruled out by the other two individuals' data. However, this gene is already known to be linked to a Mendelian disorder. The two siblings had a number of symptoms which do not fit with any other Miller case -- and well fit the symptoms of DNAH5 mutation. So these two individuals have two rare genetic diseases!
Getting back to DHODH, is it the culprit in Miller? Sequencing three further unrelated patients found them all to be compound heterzygotes for mutations predicted to be damaging. So it becomes reasonable to infer that a false prediction of non-damaging was made for G605A. Sequencing of DHODH in parents of the affected individuals confirmed that each was a carrier, ruling out DHODH as a causative gene under a dominant model.
DHODH is known to encode dihydroorotate dehydrogenase, which catalyzes a biochemical step in the de novo synthesis of pyrimidines. This is a pathway targeted in some cancer chemotherapies, with the unfortunate result that some individuals are exposed to these drugs in utero -- and these persons manifest symptoms similar to Miller syndrome. Furthermore, another genetic disease (Nagler) has great overlap in symptoms with Miller -- but sequencing of DHODH in 12 unrelated patients failed to find any coding mutations in DHODH.
The authors point to the possible impact of this approach. They note that there are 7,000 diseases which affect fewer than 200K patients in the U.S. (a widely used definition of rare disease), but in aggregate this is more than 25M persons. Identifying the underlying mutations for a large fraction of these diseases would advance our understanding of human biology greatly, and with a bit of luck some of these mutations will suggest practical therapeutic or dietary approaches which can ameliorate the disease.
Despite the success here, they also underline opportunities for improvement. First, in some cases variant calling was difficult due to poor coverage in repeated regions. Conversely, some copy number variation manifested itself in false positive calls of variation. Second, the SNP databases for filtering will be most useful if they are derived from similar populations; if studying patients with a background poorly represented in dbSNP or HapMap then those databases won't do.
How economical a strategy would this be? Whole exome sequencing on this scale can be purchased for a bit under $20K/individual; to try to do this by Sanger would probably be at least 25X that. So whole exome sequencing of the 4 original individuals would be less than $100K for sequencing (but clearly a bunch more for interpretation, sample collection, etc). The follow-up sequencing would a add a bit, but probably less than one exome's worth of sequencing. Even if a study turned up a lot of candidate variants, smaller scale targeted sequencing can be had for $5K or less per sample. Digging into the methods, the study actually used two passes of array capture -- the second to clean up what wasn't captured well by the first array design & to add newer gene predictions. This is a great opportunity to learn from these projects -- the array designs can keep being refined to provide even coverage across the targeted genes. And, of course, as the cost per base of the sequencing portion continues its downwards slide this will get even more attractive -- or possibly simply be displaced by really cheap whole genome sequencing. If the cost of the exome sequencing can be approximately halved, then perhaps a project similar to this could be run for around $100K.
So, if 700 diseases could each be examined at 100K/disease, that would come out to $70M -- hardly chump change. This underlines the huge utility of getting sequencing costs down another order of magnitude. At $1000/genome, the sequencing costs of the project would stop grossly overshadowing the other key areas - sample collection & data interpretation. If the total cost of such a project could be brought down closer to $20K, then now we're looking at $14M to investigate all described rare genetic disorders. That's not to say it shouldn't be done at $70M or even several times that, but ideally some of the money saved by cheaper sequencing could go to elucidating the biology of the causative alleles such a campaign would unearth, because certainly many of them will be much more enigmatic than DHODH.

Sarah B. Ng, Kati J. Buckingham, Choli Lee, Abigail W. Bigham, Holly K. Tabor, Karin M. Dent, Chad D. Huff, Paul T. Shannon, Ethylin Wang Jabs, Deborah A. Nickerson, Jay Shendure, & Michael J. Bamshad (2009). Exome sequencing identifies the cause of a mendelian disorder Nature genetics : doi:10.1038/ng.499
The impressive economy of this paper is that they targeted (using Agilent chips) less than 30Mb of the human genome, which is less than 1%. They also worked with very few samples; only about 30 cases of Miller Syndrome have been reported in the literature. While I've expressed some reservations about "exome sequencing", this paper does illustrate why it can be very cost effective and my objections (perhaps not made clear enough before) is more a worry about being too restricted to "exomes" and less about targeting.
Only four affected individuals (two siblings and two individuals unrelated to anyone else in the study) were sequenced, each at around 40X coverage of the targeted regions. Since Miller is so vanishingly rare, the causative mutations should be absent from samples of human diversity such as dbSNP or the HapMap, so these was used as a filter. Non-synonymous (protein-altering), splice site mutations & coding indels were considered as candidates. Both dominant models and recessive models were considered. Combining the data from both siblings, 228 candidate dominant genes and 9 recessive ones fell out. Looking then to the unrelated individuals zeroed in on a single gene, DHODH, under the recessive model (but 8 in the dominant model). Using a conservative statistical model, the odds of finding this by chance were estimated at 1.5x10e-05.
An interesting curve was thrown by nature. If predictions were made as to whether mutations would be damaging, then DHODH was excluded as a candidate gene under a recessive model. Both siblings carried one allele (G605A) predicted to be neutral but another allele predicted to be damaging.
Another interesting curve is a second gene, DNAH5, which was a candidate considering only the siblings' data but ruled out by the other two individuals' data. However, this gene is already known to be linked to a Mendelian disorder. The two siblings had a number of symptoms which do not fit with any other Miller case -- and well fit the symptoms of DNAH5 mutation. So these two individuals have two rare genetic diseases!
Getting back to DHODH, is it the culprit in Miller? Sequencing three further unrelated patients found them all to be compound heterzygotes for mutations predicted to be damaging. So it becomes reasonable to infer that a false prediction of non-damaging was made for G605A. Sequencing of DHODH in parents of the affected individuals confirmed that each was a carrier, ruling out DHODH as a causative gene under a dominant model.
DHODH is known to encode dihydroorotate dehydrogenase, which catalyzes a biochemical step in the de novo synthesis of pyrimidines. This is a pathway targeted in some cancer chemotherapies, with the unfortunate result that some individuals are exposed to these drugs in utero -- and these persons manifest symptoms similar to Miller syndrome. Furthermore, another genetic disease (Nagler) has great overlap in symptoms with Miller -- but sequencing of DHODH in 12 unrelated patients failed to find any coding mutations in DHODH.
The authors point to the possible impact of this approach. They note that there are 7,000 diseases which affect fewer than 200K patients in the U.S. (a widely used definition of rare disease), but in aggregate this is more than 25M persons. Identifying the underlying mutations for a large fraction of these diseases would advance our understanding of human biology greatly, and with a bit of luck some of these mutations will suggest practical therapeutic or dietary approaches which can ameliorate the disease.
Despite the success here, they also underline opportunities for improvement. First, in some cases variant calling was difficult due to poor coverage in repeated regions. Conversely, some copy number variation manifested itself in false positive calls of variation. Second, the SNP databases for filtering will be most useful if they are derived from similar populations; if studying patients with a background poorly represented in dbSNP or HapMap then those databases won't do.
How economical a strategy would this be? Whole exome sequencing on this scale can be purchased for a bit under $20K/individual; to try to do this by Sanger would probably be at least 25X that. So whole exome sequencing of the 4 original individuals would be less than $100K for sequencing (but clearly a bunch more for interpretation, sample collection, etc). The follow-up sequencing would a add a bit, but probably less than one exome's worth of sequencing. Even if a study turned up a lot of candidate variants, smaller scale targeted sequencing can be had for $5K or less per sample. Digging into the methods, the study actually used two passes of array capture -- the second to clean up what wasn't captured well by the first array design & to add newer gene predictions. This is a great opportunity to learn from these projects -- the array designs can keep being refined to provide even coverage across the targeted genes. And, of course, as the cost per base of the sequencing portion continues its downwards slide this will get even more attractive -- or possibly simply be displaced by really cheap whole genome sequencing. If the cost of the exome sequencing can be approximately halved, then perhaps a project similar to this could be run for around $100K.
So, if 700 diseases could each be examined at 100K/disease, that would come out to $70M -- hardly chump change. This underlines the huge utility of getting sequencing costs down another order of magnitude. At $1000/genome, the sequencing costs of the project would stop grossly overshadowing the other key areas - sample collection & data interpretation. If the total cost of such a project could be brought down closer to $20K, then now we're looking at $14M to investigate all described rare genetic disorders. That's not to say it shouldn't be done at $70M or even several times that, but ideally some of the money saved by cheaper sequencing could go to elucidating the biology of the causative alleles such a campaign would unearth, because certainly many of them will be much more enigmatic than DHODH.

Sarah B. Ng, Kati J. Buckingham, Choli Lee, Abigail W. Bigham, Holly K. Tabor, Karin M. Dent, Chad D. Huff, Paul T. Shannon, Ethylin Wang Jabs, Deborah A. Nickerson, Jay Shendure, & Michael J. Bamshad (2009). Exome sequencing identifies the cause of a mendelian disorder Nature genetics : doi:10.1038/ng.499
Thursday, November 12, 2009
A 10,201 Genomes Project
With valuable information emerging from the 1000 (human) genomes project and now a proposal for a 10,000 vertebrate genome project, it's well past time to expose to public scrutiny a project I've been spitballing for a while, which I now dub the 10,201 genomes project. Why that? Well, first it's a bigger number than the others. Second, it's 101 squared.
Okay, perhaps my faithful assistant is swaying me, but I still think it's a useful concept, even if for the time being it must remain a gehunden experiment. All kidding aside, the goal would be to sequence the full breadth of caninity with the prime focus on elucidating the genetic machinery of mammalian morphology. In my biological world, that would be more than enough to justify such a project once the price tag comes down to a few million. With some judicious choices, some fascinating genetic influences on complex behaviors might also emerge. And yes, there is a possibility of some of this feeding back to useful medical advances, though one should be honest to say that this is likely to be a long and winding road. It really devalues saying something will impact medicine when we claim every project will do so.
The general concept would be to collect samples from multiple individuals of every known dog breed, paying attention to important variation within breed standards. It would also be valuable to collect well-annotated samples from individuals who are not purebred but exhibit interesting morphology. For example, I've met a number of "labradoodles" (Labrador retriever x poodle) and they exhibit a wide range of sizes, coat colors and other characteristics -- precisely the fodder for such an experiment. In a similar manner, it is said that the same breed from geographically distant breeders may be quite distinct, so it would be valuable to collect individuals from far-and-wide. But going beyond domesticated dogs, it would be useful to sequence all the wild species as well. With genomes at $1K a run, this would make good sense. Of particular interest for a non-dog genome is the case of lines of foxes. which have been bred over just a half century into a very docile line and a second selected for aggressive tendencies.
What realistically could we expect to find? One would expect a novel gene, as is the case with short legged breeds, to leap out. Presumably regions which have undergone selective sweeps would be spottable as well and linkable to traits. A wealth of high-resolution copy number information would certainly emerge.
Is it worth funding? Well, I'm obviously biased. But already the 10,000 vertebrate genome has kicked up some dust from some who are disappointed that the genomics community has not had "an inordinate fondness for beetles" (only one sequenced so far). Genome sequencing is going to get much cheaper, but never "too cheap to meter". De novo projects will always be inherently more expensive due to more extensive informatics requirements -- the first annotation of the genome is highly valuable but requires extensive effort. I too am disappointed that greater sampling of arthropods hasn't been sequenced -- and it's hard to imagine folks in the evo-devo world being fond of this point either.
It's hard for me to argue against sequencing thousands of human germlines to uncover valuable medical information or to sequence tens of thousands of somatic cancer genomes for the same reason. But, even so I'd hate to see that push out funding for filling in more information about the tree of life. Still, do we really need 10,000 vertebrate genomes in the near future or 10,201 dog genomes? If the trade for doing only 5,000 additional vertebrates is doing 5,000 diverse invertebrates, I think that is hard to argue against. Depth vs. breadth will always be a challenging call, but perhaps breadth should be favored a bit more -- at least once I'm funded for my ultra-deep project!
Okay, perhaps my faithful assistant is swaying me, but I still think it's a useful concept, even if for the time being it must remain a gehunden experiment. All kidding aside, the goal would be to sequence the full breadth of caninity with the prime focus on elucidating the genetic machinery of mammalian morphology. In my biological world, that would be more than enough to justify such a project once the price tag comes down to a few million. With some judicious choices, some fascinating genetic influences on complex behaviors might also emerge. And yes, there is a possibility of some of this feeding back to useful medical advances, though one should be honest to say that this is likely to be a long and winding road. It really devalues saying something will impact medicine when we claim every project will do so.
The general concept would be to collect samples from multiple individuals of every known dog breed, paying attention to important variation within breed standards. It would also be valuable to collect well-annotated samples from individuals who are not purebred but exhibit interesting morphology. For example, I've met a number of "labradoodles" (Labrador retriever x poodle) and they exhibit a wide range of sizes, coat colors and other characteristics -- precisely the fodder for such an experiment. In a similar manner, it is said that the same breed from geographically distant breeders may be quite distinct, so it would be valuable to collect individuals from far-and-wide. But going beyond domesticated dogs, it would be useful to sequence all the wild species as well. With genomes at $1K a run, this would make good sense. Of particular interest for a non-dog genome is the case of lines of foxes. which have been bred over just a half century into a very docile line and a second selected for aggressive tendencies.
What realistically could we expect to find? One would expect a novel gene, as is the case with short legged breeds, to leap out. Presumably regions which have undergone selective sweeps would be spottable as well and linkable to traits. A wealth of high-resolution copy number information would certainly emerge.
Is it worth funding? Well, I'm obviously biased. But already the 10,000 vertebrate genome has kicked up some dust from some who are disappointed that the genomics community has not had "an inordinate fondness for beetles" (only one sequenced so far). Genome sequencing is going to get much cheaper, but never "too cheap to meter". De novo projects will always be inherently more expensive due to more extensive informatics requirements -- the first annotation of the genome is highly valuable but requires extensive effort. I too am disappointed that greater sampling of arthropods hasn't been sequenced -- and it's hard to imagine folks in the evo-devo world being fond of this point either.
It's hard for me to argue against sequencing thousands of human germlines to uncover valuable medical information or to sequence tens of thousands of somatic cancer genomes for the same reason. But, even so I'd hate to see that push out funding for filling in more information about the tree of life. Still, do we really need 10,000 vertebrate genomes in the near future or 10,201 dog genomes? If the trade for doing only 5,000 additional vertebrates is doing 5,000 diverse invertebrates, I think that is hard to argue against. Depth vs. breadth will always be a challenging call, but perhaps breadth should be favored a bit more -- at least once I'm funded for my ultra-deep project!
Wednesday, October 14, 2009
Why I'm Not Crazy About The Term "Exome Sequencing"
I find myself worrying sometimes that I worry too much about the words I use -- and worry some of the rest of the time that I don't worry enough. What can seem like the right words at one time might seem wrong some other time. The terms "killer app" are thrown around a lot in the tech space, but would you really want to hear it used about sequencing a genome if you were the patient whose DNA was under scrutiny?
One term that sees a lot of traction these days is "exome sequencing". I listened in on a free Science magazine webinar today on the topic, and the presentations were all worthwhile. The focus was on the Nimblegen capture technology (Roche/Nimblegen/454 sponsored the webinar), though other technologies were touched on.
By "exome sequencing" what is generally meant is to capture & sequence the exons in the human genome in order to find variants of interest. Exons have the advantage of being much more interpretable than non-coding sequences; we have some degree of theory (though quite incomplete) which enables prioritizing these variants. The approach also has the advantage of being significantly cheaper at the moment than whole genome sequencing (one speaker estimated $20K per exome). So what's the problem?
My concern is that the terms "exome sequencing" are taken a bit too literally. Now, it is true that these approaches catch a bit of surrounding DNA due to library construction and the targeting approaches cover splice junctions, but what about some of the other important sequences? According to my poll of practitioners of this art, their targets are entirely exons (confession: N=1 for the poll).
I don't have a general theory for analyzing non-coding variants, but conversely there are quite a few well annotated non-coding regions of functional significance. An obvious case are promoters. Annotation of human promoters and enhancers and other transcriptional doodads is an ongoing process, but some have been well characterized. In particular, the promoters for many drug metabolizing enzymes have been scrutinized because these may have significant effects on how much of the enzyme is synthesized and therefore drug metabolism.
Partly coloring my concern is the fact that exome sequencing kits are becoming standardized; at least two are on the market currently. Hence, the design shortcomings of today might influence a lot of studies. Clearly sequencing every last candidate promoter or enhancer would tend to defeat the advantages of exome sequencing, but I believe a reasonable shortlist of important elements could be rapidly identified.
My own professional interest area, cancer genomics, adds some additional twists. At least one major cancer genome effort (at the Broad) is using exome sequencing. On the one hand, it is true that there are relatively few recurrent, focused non-coding alterations documented in cancer. However, few is not none. For example, in lung cancer the c-Met oncogene has been documented to be activated by mutations within an intron; these mutations cause skipping of an exon encoding an inhibitory domain. Some of these alterations are about 50 nucleotides away from the nearest splice junction -- a distance that is likely to result in low or no coverage using the Broad's in solution capture technology (confession #2: I haven't verified this with data from that system).
The drug metabolizing enzyme promoters I mentioned before are a bit greyer for cancer genomics. On the one hand, one is generally primarily interested in what somatic mutations have occurred on the tumor. On the other hand, the norm in cancer genomics is tending towards applying the same approach to normal (cheek swab or lymphocyte) DNA from the patient, and why not get the DME promoters too? After all, these variants may have influenced the activity of therapeutic agents or even development of the disease. Just as some somatic mutations seem to cluster enigmatically with patient characteristics, perhaps some somatic mutations will correlate with germline variants which contributed to disease initiation.
Whatever my worries, they should be time-limited. Exome sequencing products will be under extreme pricing pressure from whole genome sequencing. The $20K cited (probably using 454 sequencing) is already potentially matched by one vendor (Complete Genomics). Now, in general the cost of capture will probably be a relatively small contributor compared to the cost of data generation, so exome sequencing will ride much of the same cost curve as the rest of the industry. But, it probably is $1-3K for whole exome capture due to the multiple chips required and the labor investment (anyone have a better estimate?). If whole mammalian genome sequencing really can be pushed down into the $5K range, then mammalian exome sequencing will not offer a huge cost advantage if any. I'd guess interest in mammalian exome sequencing will peak in a year or two, so maybe I should stop worrying and learn to love the hyb.
One term that sees a lot of traction these days is "exome sequencing". I listened in on a free Science magazine webinar today on the topic, and the presentations were all worthwhile. The focus was on the Nimblegen capture technology (Roche/Nimblegen/454 sponsored the webinar), though other technologies were touched on.
By "exome sequencing" what is generally meant is to capture & sequence the exons in the human genome in order to find variants of interest. Exons have the advantage of being much more interpretable than non-coding sequences; we have some degree of theory (though quite incomplete) which enables prioritizing these variants. The approach also has the advantage of being significantly cheaper at the moment than whole genome sequencing (one speaker estimated $20K per exome). So what's the problem?
My concern is that the terms "exome sequencing" are taken a bit too literally. Now, it is true that these approaches catch a bit of surrounding DNA due to library construction and the targeting approaches cover splice junctions, but what about some of the other important sequences? According to my poll of practitioners of this art, their targets are entirely exons (confession: N=1 for the poll).
I don't have a general theory for analyzing non-coding variants, but conversely there are quite a few well annotated non-coding regions of functional significance. An obvious case are promoters. Annotation of human promoters and enhancers and other transcriptional doodads is an ongoing process, but some have been well characterized. In particular, the promoters for many drug metabolizing enzymes have been scrutinized because these may have significant effects on how much of the enzyme is synthesized and therefore drug metabolism.
Partly coloring my concern is the fact that exome sequencing kits are becoming standardized; at least two are on the market currently. Hence, the design shortcomings of today might influence a lot of studies. Clearly sequencing every last candidate promoter or enhancer would tend to defeat the advantages of exome sequencing, but I believe a reasonable shortlist of important elements could be rapidly identified.
My own professional interest area, cancer genomics, adds some additional twists. At least one major cancer genome effort (at the Broad) is using exome sequencing. On the one hand, it is true that there are relatively few recurrent, focused non-coding alterations documented in cancer. However, few is not none. For example, in lung cancer the c-Met oncogene has been documented to be activated by mutations within an intron; these mutations cause skipping of an exon encoding an inhibitory domain. Some of these alterations are about 50 nucleotides away from the nearest splice junction -- a distance that is likely to result in low or no coverage using the Broad's in solution capture technology (confession #2: I haven't verified this with data from that system).
The drug metabolizing enzyme promoters I mentioned before are a bit greyer for cancer genomics. On the one hand, one is generally primarily interested in what somatic mutations have occurred on the tumor. On the other hand, the norm in cancer genomics is tending towards applying the same approach to normal (cheek swab or lymphocyte) DNA from the patient, and why not get the DME promoters too? After all, these variants may have influenced the activity of therapeutic agents or even development of the disease. Just as some somatic mutations seem to cluster enigmatically with patient characteristics, perhaps some somatic mutations will correlate with germline variants which contributed to disease initiation.
Whatever my worries, they should be time-limited. Exome sequencing products will be under extreme pricing pressure from whole genome sequencing. The $20K cited (probably using 454 sequencing) is already potentially matched by one vendor (Complete Genomics). Now, in general the cost of capture will probably be a relatively small contributor compared to the cost of data generation, so exome sequencing will ride much of the same cost curve as the rest of the industry. But, it probably is $1-3K for whole exome capture due to the multiple chips required and the labor investment (anyone have a better estimate?). If whole mammalian genome sequencing really can be pushed down into the $5K range, then mammalian exome sequencing will not offer a huge cost advantage if any. I'd guess interest in mammalian exome sequencing will peak in a year or two, so maybe I should stop worrying and learn to love the hyb.
Wednesday, September 09, 2009
A Blight's Genome
Normally this time of year I would be watching the weather forecasts checking for the dreaded early frost which slays tomato plants, often followed by weeks of mild weather that could have permitted further growth. Alas, this year that will clearly not be the case. A wet growing season and commercial stock contaminated with spores has led to an epidemic of late blight, and my tomato plants (as shown, with the night photography accentuating the horror) are being slaughtered. This weekend I'll clear the whole mess out & for the next few years plant somewhere else.
Late blight is particularly horrid as it attacks both foliage and fruits -- many tomato diseases simply kill the foliage. What looked like promising green tomatoes a week ago are now disgusting brown blobs.
Late blight is caused by Phytophora infestans, a fungus-like organism. An even more devastating historical manifestation of this ogre was the Great Irish Potato Famine and remains a scourge of potato farmers. Given my current difficulties with it, I was quite excited to see the publication of the Phytophora infestans genome sequence (by Sanger) in Nature today.
A sizable chunk of the paper is devoted to the general structure of the genome, which tops out at 240Mb. Two related plant pathogens, P.sojae (soybean root rot) and P.ramorum (sudden oak death) come in only at 95Mb and 65Mb respectively. What accounts for the increase? While the genome does not seem to be duplicated as a whole, a number of gene families implicated in plant pathogenesis have been found.
Also in great numbers are transposons. About a third of the genome are Gypsy-type retrotransposons. Several other classes of transposons are present also. In the end, just over a quarter (26%) of the genome is non-repetitive. While these transposons do not themselves appear to contain phytopathological genes, their presence appears to be driving expansion of some key families of such genes. Comparison of genomic scaffolds with the other two sequenced Phytophora show striking overall conservation of conserved genes, but with local rearrangements and expansion of the zones between conserved genes (Figure 1 plus S18 and S19). Continuing evolutionary activity in this space is shown by the fact that some of these genes are apparently inactivated but have only small numbers of mutations, suggesting very recent conversion to pseudogenes. A transposon polymorphism was also found -- an insertion in one haplotype which is absent in another (figure S9)
A curious additional effect shown off in two-D plots of 5' vs. 3' intergenic length (Figure 2. Overall this distribution is a huge blob, but for some of the pathogenesis gene classes are clustered in the quadrant where both intergenic regions are large -- conversely many of the core genes are clustered in the graph in the "both small" quadrant. Supplemental Figure S2 shows rather strikingly how splayed the distribution is for P.infestans -- other genomes show much tighter distributions but P.infestans seems to have quite a few intergenic regions at about every possible scale.
The news item accompanying the paper puts some perspective on all this: P.infestans is armed with lots of anti-plant weapons which enable it to evolve evasions to plant resistance mechanisms. A quoted plant scientist offers a glum perspective
After taking 15 years to incorporate this resistance in a cultivar, it would take Phytophthora infestans only a couple of years to defeat it.. Chemical control of P.infestans reportedly works only before the infection is apparent and probably involves stuff I'd rather not play with.
A quick side trip to Wikipedia finds that the genus is a pack of blights. Indeed, Phytophora is coined from the Greek for "plant destruction". Other horticultural curses from this genus include alder root rot, rhododendron root rot, cinnamon root rot (not of cinnamon, but rather various woody plants) and fruit rots in a wide variety of useful & yummy fruits including strawberries, cucumbers and coconuts. What an ugly family tree!
The Wikipedia entry also sheds light on why these awfuls are referred to as "fungus-like". While they have a life cycle and some morphology similarities to fungi, their cell walls are mostly cellulose and molecular phylogenetics place them closer to plants than to fungi.
So, the P.infestans genome sequence sheds light how this pathogen can shift its attacks quickly. Unfortunately, as with human genomic medicine, it will take a long time to figure out how to outsmart these assaults, particularly in a manner practical and safe for commercial growers and home gardeners alike.

BJ Haas et al (2009). Genome sequence and analysis of the Irish potato famine pathogen Phytophthora infestans Nature : 10.1038/nature08358
Tuesday, September 08, 2009
Next-generation Physical Maps III: HAPPy Maps
A second paper which triggered my current physical map madness is a piece (open access!) arguing for the adaptation of HAPPY mapping to next-gen sequencing. This is intriguing in part because I see (and have) a need for cheap & facile access to the underlying technologies but also because I think there are some interesting computational problems (not touched on the paper, as I will elaborate below) and some additional uses to the general approach.
HAPPy mapping is a method developed by Simon Dear for physical mapping. The basic notion is that large DNA fragments (the size range determining several important parameters of the map; the maximum range between two markers is about 10 times the minimum resolution) are randomly gathered in pools which each contain approximately one half a genome equivalent. Practical issues limit the maximum size fragments to about 1Mb; any larger and they can't be handled in vitro. By typing markers across these pools, a map can be generated. If two markers are on different chromosomes or are farther apart than the DNA fragment size, then there will be no correlation between them. On the other hand, two markers which are very close together on a chromosome will tend to show up together in a pool. Traditionally, HAPPy pools have been typed by PCR assays designed to known sequences. One beauty of HAPPy mapping is that it is nearly universal; if you can extract high molecular weight DNA from an organism then a HAPPy map should be possible.
The next-gen version of this proposed by the authors would make HAPPy pools as before but then type them by sequence sampling the pools. Given that a HAPPy pool contains many orders of magnitude less DNA than current next-gen library protocols require, they propose using whole-genome amplification to boost the DNA. Then each pool would be converted to a bar-coded sequencing library. The final typing would be performed by incorporating these reads into a shotgun assembly and then scoring each contig as present or absent in a pool. Elegant!
When would this mapping occur? One suggestion is to first generate a rough assembly using standard shotgun sequencing, as this improves the estimate of the genome size which in turn enables the HAPPy pools to be optimally constructed so that any given fragment will be in 50% of the pools. Alternatively, if a good estimate of the genome size is known the HAPPy pools could potentially be the source of all of the shotgun data (this is hinted at).
One possible variation to this approach would be to replace bar-coded libraries and WGA with Helicos sequencing, which can theoretically work on very small amounts of DNA. Fragmenting such tiny amounts would be one challenge to be overcome, and of course the Helicos generates much shorter, lower-quality reads than the other platforms. But, since these reads are primarily for building a physical map (or, in sequence terms driving to larger supercontigs), that may not be fatal.
If going with one of the other next-gen platforms (as noted in the previous post in this series, perhaps microarrays make sense as a readout), there is the question of input DNA. For example, mammalian genomes range in size from 1.73pg to 8.40pg. A lot of next-gen library protocols seem to call for more like 1-10ug of DNA, or about 6 logs more. The HAPPy paper's authors suggest whole-genome amplification, which is reasonable but could potentially introduce bias. In particular, it could be problematic to allow reads from amplified DNA to be the primary or even a major source of reads for the assembly. As I've noted before, other approaches such as molecular inversion probes might be useful for low amounts, but have not been demonstrated to my knowledge with picograms of input DNA. However, today I stumbled on two papers, one from the Max Planck Institute and one from Stanford, which use digital PCR to quantitate next-gen libraries and assert that this can assist in successfully preparing libraries from tiny amounts of DNA. It may also be possible to deal with this issue by attempting more than the required number of libraries, determining which built successfully by digital PCR and then pooling a sufficient number of successful libraries.
The desirable number of HAPPY libraries and the desired sequencing depth for each library are two topics not covered well in the paper, which is unfortunate. The number of libraries presumably affects both resolution and confidence in the map. Pretty much the entire coverage of this is the tail of one paragraph
Changing the marker system from direct testing of sequence-tagged sites by PCR to sequencing-based sampling has an important implication as discussed in the last post. If your PCR is working well, then if a pool contains a target it will come up positive. But with the sequencing, there is a very real chance of not detecting a marker present in a pool. This probability will depend on the size of the target -- very large contigs will have very little chance of being missed, but as contigs go smaller their probability of being missed goes up. Furthermore, actual size won't be as important as the effective size: the amount of sequence which can be reliably aligned. In other words, two contigs might be the same length, but if one has a higher repeat count that contig will be more easily detectable. These parameters in turn can be estimated from the actual data.
The actual size of the pool is a critical parameter as well. So, the sampling depth (for a given haploid genome size) will determine
In any case, the problem of false negatives must be addressed. One approach is to only map contigs which are unlikely to have ever been missed. However, that means losing the ability to map smaller contigs. Presumably there are clever computational approaches to either impute missing data or simply deal with it.
It should also be noted that HAPPy maps, like many physical mapping techniques, are likely to yield long-range haplotype information. Hence, even after sequencing one individual the approach will retain utility. Indeed, this seems to be the tack that Complete Genomics is taking to obtain this information for human genomes, though they call it Long Fragment Reads. It is worth noting that the haplotyping application has one clear difference from straight HAPPy mapping. In HAPPy mapping, the optimal pool size is one in which any given genome fragment is expected to appear in half the pools, which means pools of about 0.7X genome. But for haplotyping (and for trying to count copy numbers and similar structural issues), it is desirable to have the pools much smaller, as this information can only be obtained if a given region of the genome is haploid in that pool. Ideally, this would mean each fragment in its own pool (library), but realistically this will mean as small a pool size as one can make and still cover the whole genome in the targeted number of pools. Genomes of higher ploidies, such as many crops which are tetraploid, hexaploid or even octaploid, would probably require more pools with lower genomic fractions in order to resolve haplotypes.
In conclusion, HAPPy mapping comes close to my personal ideal of a purely in vitro mapping system which looks like a clever means of preparing next-gen libraries. The minimum and maximum distances resolvable are about 10-fold apart, so more than one set of HAPPy libraries is likely to be desirable for an organism. Typically this is two sizes, since the maximum fragment size is around 1Mb (and may be smaller from organisms with difficult to extract DNA). A key problem to resolve is that HAPPy pools contain single digit picograms of DNA. Amplification is a potential solution but may introduce bias; clever library preparation (or screening) may be another approach. An open problem is the best depth of coverage of the multiplexed HAPPy next-gen libraries. HAPPy can be used both for physical mapping and long-range haplotyping, though the fraction of genome in a pool will differ for these different applications.
Jiang Z, Rokhsar DS, & Harland RM (2009). Old can be new again: HAPPY whole genome sequencing, mapping and assembly. International journal of biological sciences, 5 (4), 298-303 PMID: 19381348
HAPPy mapping is a method developed by Simon Dear for physical mapping. The basic notion is that large DNA fragments (the size range determining several important parameters of the map; the maximum range between two markers is about 10 times the minimum resolution) are randomly gathered in pools which each contain approximately one half a genome equivalent. Practical issues limit the maximum size fragments to about 1Mb; any larger and they can't be handled in vitro. By typing markers across these pools, a map can be generated. If two markers are on different chromosomes or are farther apart than the DNA fragment size, then there will be no correlation between them. On the other hand, two markers which are very close together on a chromosome will tend to show up together in a pool. Traditionally, HAPPy pools have been typed by PCR assays designed to known sequences. One beauty of HAPPy mapping is that it is nearly universal; if you can extract high molecular weight DNA from an organism then a HAPPy map should be possible.
The next-gen version of this proposed by the authors would make HAPPy pools as before but then type them by sequence sampling the pools. Given that a HAPPy pool contains many orders of magnitude less DNA than current next-gen library protocols require, they propose using whole-genome amplification to boost the DNA. Then each pool would be converted to a bar-coded sequencing library. The final typing would be performed by incorporating these reads into a shotgun assembly and then scoring each contig as present or absent in a pool. Elegant!
When would this mapping occur? One suggestion is to first generate a rough assembly using standard shotgun sequencing, as this improves the estimate of the genome size which in turn enables the HAPPy pools to be optimally constructed so that any given fragment will be in 50% of the pools. Alternatively, if a good estimate of the genome size is known the HAPPy pools could potentially be the source of all of the shotgun data (this is hinted at).
One possible variation to this approach would be to replace bar-coded libraries and WGA with Helicos sequencing, which can theoretically work on very small amounts of DNA. Fragmenting such tiny amounts would be one challenge to be overcome, and of course the Helicos generates much shorter, lower-quality reads than the other platforms. But, since these reads are primarily for building a physical map (or, in sequence terms driving to larger supercontigs), that may not be fatal.
If going with one of the other next-gen platforms (as noted in the previous post in this series, perhaps microarrays make sense as a readout), there is the question of input DNA. For example, mammalian genomes range in size from 1.73pg to 8.40pg. A lot of next-gen library protocols seem to call for more like 1-10ug of DNA, or about 6 logs more. The HAPPy paper's authors suggest whole-genome amplification, which is reasonable but could potentially introduce bias. In particular, it could be problematic to allow reads from amplified DNA to be the primary or even a major source of reads for the assembly. As I've noted before, other approaches such as molecular inversion probes might be useful for low amounts, but have not been demonstrated to my knowledge with picograms of input DNA. However, today I stumbled on two papers, one from the Max Planck Institute and one from Stanford, which use digital PCR to quantitate next-gen libraries and assert that this can assist in successfully preparing libraries from tiny amounts of DNA. It may also be possible to deal with this issue by attempting more than the required number of libraries, determining which built successfully by digital PCR and then pooling a sufficient number of successful libraries.
The desirable number of HAPPY libraries and the desired sequencing depth for each library are two topics not covered well in the paper, which is unfortunate. The number of libraries presumably affects both resolution and confidence in the map. Pretty much the entire coverage of this is the tail of one paragraph
In the Illumina/Solexa system, DNA can be randomly sheared and amplified with primers that contain a 3 bp barcode. Using current instruments, reagents, and protocols, one Solexa "lane" generates ~120 Mb in ~3 million reads of ~40 bp. When each Solexa lane is multiplexed with 12 barcodes, for example, it will provide on average, ~10 Mb of sequence in ~250,000 reads for each sample. At this level of multiplexing, one Solexa instrument "run" (7 lanes plus control) would allow tag sequencing of 84 HAPPY samples. This means, one can finish 192 HAPPY samples in a maximum of three runs. New-generation sequencing combined with the barcode technique will produce innumerous amounts of sequences for assembly.
Changing the marker system from direct testing of sequence-tagged sites by PCR to sequencing-based sampling has an important implication as discussed in the last post. If your PCR is working well, then if a pool contains a target it will come up positive. But with the sequencing, there is a very real chance of not detecting a marker present in a pool. This probability will depend on the size of the target -- very large contigs will have very little chance of being missed, but as contigs go smaller their probability of being missed goes up. Furthermore, actual size won't be as important as the effective size: the amount of sequence which can be reliably aligned. In other words, two contigs might be the same length, but if one has a higher repeat count that contig will be more easily detectable. These parameters in turn can be estimated from the actual data.
The actual size of the pool is a critical parameter as well. So, the sampling depth (for a given haploid genome size) will determine
In any case, the problem of false negatives must be addressed. One approach is to only map contigs which are unlikely to have ever been missed. However, that means losing the ability to map smaller contigs. Presumably there are clever computational approaches to either impute missing data or simply deal with it.
It should also be noted that HAPPy maps, like many physical mapping techniques, are likely to yield long-range haplotype information. Hence, even after sequencing one individual the approach will retain utility. Indeed, this seems to be the tack that Complete Genomics is taking to obtain this information for human genomes, though they call it Long Fragment Reads. It is worth noting that the haplotyping application has one clear difference from straight HAPPy mapping. In HAPPy mapping, the optimal pool size is one in which any given genome fragment is expected to appear in half the pools, which means pools of about 0.7X genome. But for haplotyping (and for trying to count copy numbers and similar structural issues), it is desirable to have the pools much smaller, as this information can only be obtained if a given region of the genome is haploid in that pool. Ideally, this would mean each fragment in its own pool (library), but realistically this will mean as small a pool size as one can make and still cover the whole genome in the targeted number of pools. Genomes of higher ploidies, such as many crops which are tetraploid, hexaploid or even octaploid, would probably require more pools with lower genomic fractions in order to resolve haplotypes.
In conclusion, HAPPy mapping comes close to my personal ideal of a purely in vitro mapping system which looks like a clever means of preparing next-gen libraries. The minimum and maximum distances resolvable are about 10-fold apart, so more than one set of HAPPy libraries is likely to be desirable for an organism. Typically this is two sizes, since the maximum fragment size is around 1Mb (and may be smaller from organisms with difficult to extract DNA). A key problem to resolve is that HAPPy pools contain single digit picograms of DNA. Amplification is a potential solution but may introduce bias; clever library preparation (or screening) may be another approach. An open problem is the best depth of coverage of the multiplexed HAPPy next-gen libraries. HAPPy can be used both for physical mapping and long-range haplotyping, though the fraction of genome in a pool will differ for these different applications.

Subscribe to:
Posts (Atom)