Omics! Omics!: November 2009

Sunday, November 22, 2009

Targeted Sequencing Bags a Diagnosis

A nice complement to the one paper (Ng et al) I detailed last week is a paper that actually came out just before hand (Choi et al). Whereas the Ng paper used whole exome targeted sequencing to find the mutation for a previously unexplained rare genetic disease, the Choi et al paper used a similar scheme (though with a different choice of targeting platform) to find a known mutation in a patient, thereby diagnosing the patient.

The patient in question has a tightly interlocked pedigree (Figure 2), with two different consanguineous marriages shown. Put another way, this person could trace 3 paths back to one set of great-great-grandparents. Hence, they had quite a bit of DNA which was identical-by-descent, which meant that in these regions any low-frequency variant call could be safely ignored as noise. A separate scan with a SNP chip was used to identify such regions independently of the sequencing.

The patient was a 5 month old male, born prematurely at 30 weeks and with "failure to thrive and dehydration". Two spontaneous abortions and a death of another premature sibling at day 4 also characterized this family; a litany of miserable suffering. Due to imbalances in the standard blood chemistry (which, I wish the reviewers had insisted on further explanation for those of us who don't frequent that world), a kidney defect was suspected but other causes (such as infection) were not excluded.

The exome capture was this time on the Nimblegen platform, followed by Illumina sequenicng. This is not radically different from the Ng paper, which used Agilent capture and Illumina sequencing. At the moment Illumina & Agilent appear to be the only practical options for whole exome-scale capture, though there are many capture schemes published and quite a few available commercially. Lots of variants were found. One that immediately grabbed attention was a novel missense mutation which was homozygous and in a known chloride transporter, SLC26A3. This missense mutation (D652N)targets a position which is almost utterly conserved across the family, and is making a significant change in side chain (acid group to polar non-charged). Most importantly, SLC26A3 has already been shown to cause "congenital chloride-losing diarrhea" (CLD) when mutated in other positions. Clinical follow-up confirmed that fluid loss was through the intestines and not the kidneys.

One of the genetic diseases of the kidney that had been considered was Bartter syndrome, which the more precise blood chemistry did not match. Given that one patient had been suspected of Bartter but instead had CLD, the group screened 39 more patients with Bartter but lacking mutations in 4 different genes linked to this syndrome. 5 of these patients had homozygous mutations in SLC26A3, 2 of which were novel. 190 control chromosomes were also sequenced; none had mutations. 3 of these patients had further follow-up & confirmation of water loss through the gastrointestinal tract.

This study again illustrates the utility of targeted sequencing for clinical diagnosis of difficult cases. While a whole exome scan is currently in the neighborhood of $20K, more focused searches could be run far cheaper. The challenge will be in designing economical panels which will allow scanning the most important genes at low cost and designing such panels well. Presumably one could go through OMIM and find all diseases & syndromes which alter electrolyte levels and known causative gene(s). Such panels might be doable for perhaps as low as $1-5K per sample; too expensive for routine newborn screening but far better than a endless stream of tests. Of course, such panels would miss novel genes or really odd presentations, so follow-up of negative results with whole exome sequencing might be required. With newer sequencing platforms available, the costs for this may plummet to a few hundred dollars per test, which is probably on par with what the current screening of newborns for inborn errors runs. One impediment to commercial development in this field may well be the rapid evolution of platforms; companies may be hesitant that they will bet on a technology that will not last.

Of course, to some degree the distinction between the two papers is artificial. The Ng et al paper actually, as I noted, did diagnose some of their patients with known genetic disease. Similarly, the patients in this study who are now negative for known Bartter syndrome genes and for CLD would be candidates for whole exome sequencing. In the end, what matters is to make the right diagnosis for each patient so that the best treatment or supportive care can be selected.

Choi M, Scholl UI, Ji W, Liu T, Tikhonova IR, Zumbo P, Nayir A, Bakkaloğlu A, Ozen S, Sanjad S, Nelson-Williams C, Farhi A, Mane S, & Lifton RP (2009). Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proceedings of the National Academy of Sciences of the United States of America, 106 (45), 19096-101 PMID: 19861545

Thursday, November 19, 2009

Three Blows Against the Tyranny of Expensive Experiments

Second generation sequencing is great, but one of it's major issues so far is that the cost of one experiment is quite steep. Just looking at reagents, going from a ready-to-run library to sequence data is somewhere in the neighborhood of $10K-25K on 454, Illumina, Helicos or SOLiD (I'm willing to take corrections on these values, though they are based on reasonable intelligence). While in theory you can split this cost over multiple experiments by barcoding, that can be very tricky to arrange. Perhaps if core labs would start offering '1 lane of Illumina - Buy It Now!' on eBay the problem could be solved, but finding a spare lane isn't easy.

This issue manifests itself in other ways. If you are developing new protocols anywhere along the pipeline, your final assay is pretty expensive, making it challenging to work inexpensively. I've heard rumors that even some of the instrument makers feel inhibited in process development. It can also make folks a bit gun shy; Amanda heard first hand tonight from someone lamenting a project stymied under such circumstances. Even for routine operations, the methods of QC are pretty inexact so far as they don't really test whether the library is any good, just whether some bulk property (size, PCRability, quantity) is within a spec. This huge atomic cost also the huge barrier to utilization in a clinical setting; does the clinician really want to wait some indefinite amount of time until enough patient samples are queued to make the cost/sample reasonable?

Recently, I've become aware of three hopeful developments on this front. The first is the Polonator, which according to Kevin McCarthy has a consumable cost of only about $500 per run (post library construction). $500 isn't nothing to risk on a crazy idea, but it sure beats $10K. There aren't many Polonators around, but for method development in areas such as targeted capture it would seem like a great choice.

Today, another shoe fell. Roche has announced a smaller version of the 454 system, the GS Junior. While the instrument cost wasn't announced, it will supposedly generate 1/10th as much data (35+Mb from 100Kreads with 400 Q20 bases) for the same cost per basepair, suggesting that the reagent cost for a run will be in the neighborhood of $2.5K. Worse than what I described above, but rather intriguing. This is a system that may have a good chance to start making clinical inroads; $2.5K is a bit steep for a diagnostic but not ridiculous -- or you simply need to multiplex fewer samples to get the cost per sample decent. The machine is going to boast 400+bp reads, playing to the current comparative strength of the 454 chemistry. The instrument cost wasn't mentioned. While I doubt anyone would buy such a machine solely as an upfront QC for SOLiD or Illumina, with some clever custom primer design one probably could make libraries useable 454 plus one other platform.

It's an especially auspicious time for Roche to launch their baby 454, as Pacific Biosciences released some specs through GenomeWeb's In Sequence and what I've been able to scrounge about (I can't quite talk myself into asking for a subscription) this is going to put some real pressure across the market, but particularly on 454. The key specs I can find are a per run cost of $100 which will get you approximately 25K-30K reads of 1.5Kb each -- or around 45Mb of data. It may also be possible to generate 2X the data for nearly the same cost; apparently the reagents packed with one cell are really good for two run in series. Each cell takes 10-15 minutes to run (at least in some workflows) and the instrument can be loaded up with 96 of them to be handled serially. This is a similar ballpark to what the GS Junior is being announced with, though with fewer reads but longer read lengths. I haven't been able to find any error rate estimates or the instrument cost. I'll assume, just because it is new and single molecule, that the error rate will give Roche some breathing room.

But in general, PacBio looks set to really grab the market where long reads, even noisy ones, are valuable. One obvious use case is transcriptome sequencing to find alternative splice forms. Another would be to provide 1.5Kb scaffolds for genome assembly; what I've found also suggests PacBio will offer a 'strobe sequencing' mode which is akin to Helicos' dark filling technology, which is a means to get widely spaced sequence islands. This might provide scaffolding information in much larger fragments. 10Kb? 20Kb? And again, though you probably wouldn't buy the machine just for this, at $100/run it looks like a great way to QC samples going into other systems. Imagine checking a library after initial construction, then after performing hybridization selection and then after another round of selection! After all, the initial PacBio instrument won't be great for really deep sequencing. It appears it would be $5K-10K to get approximately 1X coverage of a mammalian genome -- but likely with a high error rate.

With the ability to easily sequence 96 samples at a time (though it isn't clear what sample prep will entail) does have some interesting suggestions. For example, one could do long survey sequencing of many bacterial species, with each well yielding 10X coverage of an E.coli-sized genome (a lot of bugs are this size or smaller). The data might be really noisy, but for getting a general lay-of-the-land it could be quite useful -- perhaps the data would be too noisy to tell which genes were actually functional vs. decaying pseudogenes, but you would be able to ask "what is the upper bound on the number of genes of protein family X in genome Y". if you really need high quality sequence, then a full run (or targeted sequencing) could follow.

At $100 per experiment, the sagging Sanger market might take another hit. If a quick sample prep to convert plasmids to usable form is released, then ridiculous oversampling (imagine 100K reads on a typical 1.5Kb insert in pUC scenario!) might overcome a high error rate.

One interesting impediment which PacBio has acknowledged is that they won't be able to ramp up instrument production as quickly as they might like and will be trying to place (ration) instruments strategically. I'm hoping at least one goes to a commercial service provider or a core lab willing to solicit outside business, but I'm not going to count on it.

Will Illumina & Life Technologies (SOLiD) try to create baby sequencers? Illumina does have a scheme to convert their array readers to sequencers, but from what I've seen these aren't expected to save much on reagents. Life does own the VisiGen technology, which is apparently similar to PacBio's but hasn't yet published a real proof-of-concept paper -- at least that I could find; their key patent has issued -- reading material for another night.

Tuesday, November 17, 2009

Decode -- Corpse or Phoenix?

The news that Decode has filed for bankruptcy is a sad milestone in the history of genomics companies. Thus falls either the final or penultimate human gene mapping companies, with everyone else having either disappeared entirely or exited that business. A partial list would include Sequana, Mercator, Myriad, Collaborative Research/Genome Therapeutics, Genaera and (of course) Millennium. I'm sure I'm missing some others. The one possible survivor I can think about is Perlegen, though their website is pretty bare bones, suggesting they have exited as well.

The challenge all of these companies faced, and rarely beat, was how to convert mapping discoveries into a cash stream which could pay for all that mapping. Myriad could be seen as the one success, having generated the controversial BRCA tests from their data, but (I believe) they no longer are actively looking. In new tests are in-licensed from academics.

Most other companies shed their genomics efforts as part of becoming product companies; the real money is in therapeutics. Mapping turned out to be such a weak contributor to that value stream. A major problem is that mapping information rarely led to a clear path to a therapeutic; too many targets nicely validated by genetics were complete head-scratchers as to how to create a therapeutic. Not that folks didn't try; Decode even in-licensed a drug and acquired all the pieces for a full drug development capability.

Of course, perhaps Decode's greatest notoriety came from their deCodeMe DTC genetic testing business. Given the competition & controversy in this field, that was unlikely to save them. The Icelandic financial collapse I think did them some serious damage as well. That's a reminder that companies, regardless of how they are run, sometimes have their fate channeled by events far beyond their control. A similar instance was the loss of Lion's CFO in the 9/11 attacks; he was soliciting investors at the WTC that day. The 9/11 deflation of the stock market definitely crimped a lot of money-losing biotechs plans for further fund raising.

Bankruptcies were once very rare for biotech, but quite a few have been announced recently. The old strategy of selling off the company at fire sale prices seems to be less in style these days; assets are now being sold as part of the bankruptcy proceedings. Apparently, this and perhaps other functions will continue. Bankruptcy in this case is a way of shedding incurred obligations viewed as nuisances; anyone betting on another strategy by buying the stock is out of luck.

Personally, I wish that the genetic database and biobanks which deCode have created could be transferred to an appropriate non-profit such as the Sanger. I doubt much of that data will ever be convertable into cash, particularly at the scale most investors are looking for. But a non-profit could extract the useful information and get it published, which was deCode's forte but I doubt they've mined everything that can be mined.

Sunday, November 15, 2009

Targeted Sequencing Bags a Rare Disease

Nature Genetics on Friday released the paper from Jay Shendure, Debra Nickerson and colleagues which used targeted sequencing to identify the damaged gene in a rare Mendelian disorder, Miller syndrome. The work had been presented at least in part at recent meetings, but now all of us can digest it in entirety.

The impressive economy of this paper is that they targeted (using Agilent chips) less than 30Mb of the human genome, which is less than 1%. They also worked with very few samples; only about 30 cases of Miller Syndrome have been reported in the literature. While I've expressed some reservations about "exome sequencing", this paper does illustrate why it can be very cost effective and my objections (perhaps not made clear enough before) is more a worry about being too restricted to "exomes" and less about targeting.

Only four affected individuals (two siblings and two individuals unrelated to anyone else in the study) were sequenced, each at around 40X coverage of the targeted regions. Since Miller is so vanishingly rare, the causative mutations should be absent from samples of human diversity such as dbSNP or the HapMap, so these was used as a filter. Non-synonymous (protein-altering), splice site mutations & coding indels were considered as candidates. Both dominant models and recessive models were considered. Combining the data from both siblings, 228 candidate dominant genes and 9 recessive ones fell out. Looking then to the unrelated individuals zeroed in on a single gene, DHODH, under the recessive model (but 8 in the dominant model). Using a conservative statistical model, the odds of finding this by chance were estimated at 1.5x10e-05.

An interesting curve was thrown by nature. If predictions were made as to whether mutations would be damaging, then DHODH was excluded as a candidate gene under a recessive model. Both siblings carried one allele (G605A) predicted to be neutral but another allele predicted to be damaging.

Another interesting curve is a second gene, DNAH5, which was a candidate considering only the siblings' data but ruled out by the other two individuals' data. However, this gene is already known to be linked to a Mendelian disorder. The two siblings had a number of symptoms which do not fit with any other Miller case -- and well fit the symptoms of DNAH5 mutation. So these two individuals have two rare genetic diseases!

Getting back to DHODH, is it the culprit in Miller? Sequencing three further unrelated patients found them all to be compound heterzygotes for mutations predicted to be damaging. So it becomes reasonable to infer that a false prediction of non-damaging was made for G605A. Sequencing of DHODH in parents of the affected individuals confirmed that each was a carrier, ruling out DHODH as a causative gene under a dominant model.

DHODH is known to encode dihydroorotate dehydrogenase, which catalyzes a biochemical step in the de novo synthesis of pyrimidines. This is a pathway targeted in some cancer chemotherapies, with the unfortunate result that some individuals are exposed to these drugs in utero -- and these persons manifest symptoms similar to Miller syndrome. Furthermore, another genetic disease (Nagler) has great overlap in symptoms with Miller -- but sequencing of DHODH in 12 unrelated patients failed to find any coding mutations in DHODH.

The authors point to the possible impact of this approach. They note that there are 7,000 diseases which affect fewer than 200K patients in the U.S. (a widely used definition of rare disease), but in aggregate this is more than 25M persons. Identifying the underlying mutations for a large fraction of these diseases would advance our understanding of human biology greatly, and with a bit of luck some of these mutations will suggest practical therapeutic or dietary approaches which can ameliorate the disease.

Despite the success here, they also underline opportunities for improvement. First, in some cases variant calling was difficult due to poor coverage in repeated regions. Conversely, some copy number variation manifested itself in false positive calls of variation. Second, the SNP databases for filtering will be most useful if they are derived from similar populations; if studying patients with a background poorly represented in dbSNP or HapMap then those databases won't do.

How economical a strategy would this be? Whole exome sequencing on this scale can be purchased for a bit under $20K/individual; to try to do this by Sanger would probably be at least 25X that. So whole exome sequencing of the 4 original individuals would be less than $100K for sequencing (but clearly a bunch more for interpretation, sample collection, etc). The follow-up sequencing would a add a bit, but probably less than one exome's worth of sequencing. Even if a study turned up a lot of candidate variants, smaller scale targeted sequencing can be had for $5K or less per sample. Digging into the methods, the study actually used two passes of array capture -- the second to clean up what wasn't captured well by the first array design & to add newer gene predictions. This is a great opportunity to learn from these projects -- the array designs can keep being refined to provide even coverage across the targeted genes. And, of course, as the cost per base of the sequencing portion continues its downwards slide this will get even more attractive -- or possibly simply be displaced by really cheap whole genome sequencing. If the cost of the exome sequencing can be approximately halved, then perhaps a project similar to this could be run for around $100K.

So, if 700 diseases could each be examined at 100K/disease, that would come out to $70M -- hardly chump change. This underlines the huge utility of getting sequencing costs down another order of magnitude. At $1000/genome, the sequencing costs of the project would stop grossly overshadowing the other key areas - sample collection & data interpretation. If the total cost of such a project could be brought down closer to $20K, then now we're looking at $14M to investigate all described rare genetic disorders. That's not to say it shouldn't be done at $70M or even several times that, but ideally some of the money saved by cheaper sequencing could go to elucidating the biology of the causative alleles such a campaign would unearth, because certainly many of them will be much more enigmatic than DHODH.

Sarah B. Ng, Kati J. Buckingham, Choli Lee, Abigail W. Bigham, Holly K. Tabor, Karin M. Dent, Chad D. Huff, Paul T. Shannon, Ethylin Wang Jabs, Deborah A. Nickerson, Jay Shendure, & Michael J. Bamshad (2009). Exome sequencing identifies the cause of a mendelian disorder Nature genetics : doi:10.1038/ng.499

Thursday, November 12, 2009

A 10,201 Genomes Project

With valuable information emerging from the 1000 (human) genomes project and now a proposal for a 10,000 vertebrate genome project, it's well past time to expose to public scrutiny a project I've been spitballing for a while, which I now dub the 10,201 genomes project. Why that? Well, first it's a bigger number than the others. Second, it's 101 squared.

Okay, perhaps my faithful assistant is swaying me, but I still think it's a useful concept, even if for the time being it must remain a gehunden experiment. All kidding aside, the goal would be to sequence the full breadth of caninity with the prime focus on elucidating the genetic machinery of mammalian morphology. In my biological world, that would be more than enough to justify such a project once the price tag comes down to a few million. With some judicious choices, some fascinating genetic influences on complex behaviors might also emerge. And yes, there is a possibility of some of this feeding back to useful medical advances, though one should be honest to say that this is likely to be a long and winding road. It really devalues saying something will impact medicine when we claim every project will do so.

The general concept would be to collect samples from multiple individuals of every known dog breed, paying attention to important variation within breed standards. It would also be valuable to collect well-annotated samples from individuals who are not purebred but exhibit interesting morphology. For example, I've met a number of "labradoodles" (Labrador retriever x poodle) and they exhibit a wide range of sizes, coat colors and other characteristics -- precisely the fodder for such an experiment. In a similar manner, it is said that the same breed from geographically distant breeders may be quite distinct, so it would be valuable to collect individuals from far-and-wide. But going beyond domesticated dogs, it would be useful to sequence all the wild species as well. With genomes at $1K a run, this would make good sense. Of particular interest for a non-dog genome is the case of lines of foxes. which have been bred over just a half century into a very docile line and a second selected for aggressive tendencies.

What realistically could we expect to find? One would expect a novel gene, as is the case with short legged breeds, to leap out. Presumably regions which have undergone selective sweeps would be spottable as well and linkable to traits. A wealth of high-resolution copy number information would certainly emerge.

Is it worth funding? Well, I'm obviously biased. But already the 10,000 vertebrate genome has kicked up some dust from some who are disappointed that the genomics community has not had "an inordinate fondness for beetles" (only one sequenced so far). Genome sequencing is going to get much cheaper, but never "too cheap to meter". De novo projects will always be inherently more expensive due to more extensive informatics requirements -- the first annotation of the genome is highly valuable but requires extensive effort. I too am disappointed that greater sampling of arthropods hasn't been sequenced -- and it's hard to imagine folks in the evo-devo world being fond of this point either.

It's hard for me to argue against sequencing thousands of human germlines to uncover valuable medical information or to sequence tens of thousands of somatic cancer genomes for the same reason. But, even so I'd hate to see that push out funding for filling in more information about the tree of life. Still, do we really need 10,000 vertebrate genomes in the near future or 10,201 dog genomes? If the trade for doing only 5,000 additional vertebrates is doing 5,000 diverse invertebrates, I think that is hard to argue against. Depth vs. breadth will always be a challenging call, but perhaps breadth should be favored a bit more -- at least once I'm funded for my ultra-deep project!

Wednesday, November 11, 2009

A call for new technological minds for the genome sequencing instrument fields

There's a great article in the current Nature Biotechnology (alas, you'll need a subscription to read the full text) titled "The challenges of sequencing by synthesis" as this post detailing the challenges around the current crop of sequencing-by-synthesis instruments. The paper was written by a number of the PIs on grants for $1K genome technology.

While there is one short section on the problem of sample preparation, the heart of the paper can be found in the other headings:

surface chemistry
fluorescent labels
the enzyme-substrate system
optics
throughput versus accuracy
read-length and phasing limitations

Each section is tightly written and well-balanced, with no obvious playing of favorites or bashing of anti-favorites present. Trade-offs are explored & the dreaded term (at least amongst scientists) "cost models" shows up; indeed there is more than a little bit of a nod to accounting -- but if sequencing is really going to be $1K/person on an ongoing basis the beans must be counted correctly!

I won't try to summarize much in detail; it really is hard to distill such a concentrated draught any further. Most of the ideas presented as possible solutions can be viewed as evolutionary relative to the current platforms, though a few exotic concepts are floated as well (such as synthetic aperture optics. It is noteworthy that an explicit goal of the paper is to summarize the problem areas so that new minds can approach the problem; as implied by the section title list above this is clearly a multi-discipline problem. It does somewhat suggest the question whether Nature Biotechnology, a journal I am quite fond of, was the best place for this. If new minds are desired, perhaps Physical Review Letters would have been better. But that's a very minor quibble.

Fuller CW, Middendorf LR, Benner SA, Church GM, Harris T, Huang X, Jovanovich SB, Nelson JR, Schloss JA, Schwartz DC, & Vezenov DV (2009). The challenges of sequencing by synthesis. Nature biotechnology, 27 (11), 1013-23 PMID: 19898456

Tuesday, November 10, 2009

Occult Genetic Disease

A clinical aside by Dr. Steve over at Gene Sherpas piqued my interested recently. He mentioned a 74 year old female patient of his with lung difficulties who turned out positive both by the sweat test and genetic testing for cystic fibrosis. One of her grandchildren had CF, which appears to have been a key hint in this direction. This anecdote was particularly striking to me because I had recently finished Atul Gawande's "Better" (highly recommended), which had a chapter on CF. Even today, a well treated CF patient living to such an age would be remarkable; when this woman was born living to 20 would be lucky. Clearly she either has a very modest deficit or some interesting modifier or such (late onset?) which allowed her to live to this age.

Now, if this patient didn't have any CF in her family, would one test for this? Probably not. But thinking more broadly, will this scenario be repeated frequently in the future when complete genome sequencing becomes a routine part of large numbers of medical files? Clearly we will have many "variants of unknown significance", but will we also find many cases of occult (hidden) genetic disease in which a patient shows clinical symptoms (but perhaps barely so). Having a sensitive and definitive phenotypic test will assist this greatly; showing excess saltiness of sweat is pretty clear.

From a clinical standpoint, many of these patients may be confusing -- if someone is nearly asymptomatic should they be treated? But from a biology standpoint, they should prove very informative by helping us define the biological thresholds of disease or by uncovering modifiers. Even more enticing would be the very small chance of finding examples of partial complementation -- cases where two defective alleles somehow work together to generate enough function. One example I've thought of (admittedly a bit far-fetched, but not total science fiction) would be two alleles which each produce a protein subject to instability but when heterodimerized stabilize the protein just enough.

Omics! Omics!