Omics! Omics!: 2009

Monday, December 28, 2009

Length matters!

I was looking through part of my collection of papers using Illumina sequencing and discovered an unpleasant surprise: more than one does not seem to state the read length used in the experiment. While to some this may seem trivial, I had a couple of interests. First, it's useful for estimating what can be done with the technology, and second since read lengths have been increasing it is an interesting guesstimate of when an experiment was done. Of course, there are lots of reasons to carefully pick read length -- the shorter the length, the sooner the instrument can be turned over to another experiment. Indeed, a recent paper estimates that for RNA-Seq IF you know all the transcript isoforms then 20-25 nucleotides is quite sufficient and you are interested in measuring transcript levels (they didn't, for example, discuss the ideal length for mutation/SNP discovery). Of course, that's a whopping "IF", particularly for the sorts of things I'm interested in.

Now in some cases you can back-estimate the read length using the given statistics on numbers of mapped reads and total mapped nucleotides, though I'm not even sure these numbers are reliably showing up in papers. I'm sure to some authors & reviewers they are tedious numbers of little use, but I disagree. Actually, I'd love to see each paper (in the supplementary materials) show their error statistics by read position, because this is something I think would be interesting to see the evolution of. Plus, any lab not routinely monitoring this plot is foolish -- not only would a change show important quality control information, but it also serves as an important reminder to consider the quality in how you are using the data. It's particularly surprising that the manufacturers do not have such plots prominently displayed on their website, though of course those would be suspected of being cherry-picked. One I did see from a platform supplier had a horribly chosen (or perhaps deviously chosen) scale for the Y-axis, so that the interesting information was so compressed as to be nearly useless.

I should have a chance in the very near future to take a dose of my own prescription. On writing this, it occurs to me that I am unaware of widely-available software to generate the position-specific mismatch data for such plots. I guess I just gave myself an action item!

Friday, December 18, 2009

Nano Anglerfish or Feejee Mermaids?

A few months ago I blogged enthusiastically about a paper in Science describing an approach to deorphan enzymes in parallel. Two anonymous commenters were quite derisive, claiming the chemistry for generating labeled metabolites in the paper impossible. Now Science's editor Bruce Alberts has published an expression of concern, which cites worries over the chemistry as well as the failure of the authors to post promised supporting data to their website and changing stories as to how the work was done.

The missing supporting data hits a raw nerve. I've been frustrated on more than one occasion whilst reviewing a paper that I couldn't access their supplementary data, and have certainly encountered this as a reader as well. I've sometimes meekly protested as a reviewer; in the future I resolve to consider this automatic grounds for "needs major revision". Even if the mistake is honest, it means day considered important is unavailable for consideration. Given modern publications with data which is either too large to print or simply incompatible with paper, "supplementary" data is frequently either central to the paper or certainly just off center.

This controversy also underscores a challenge for many papers which I have faced as a reviewer. To be quite honest, I'm utterly unqualified to judge the chemistry in this paper -- but feel quite qualified to judge many of the biological aspects. I have received for review papers with this same dilemma; parts I can critique and parts I can't. The real danger is if the editor inadvertantly picks reviewers who all share the same blind spot. Of course, in an ideal world a paper would always go to reviewers capable of vetting all parts of it, but with many multidisciplinary papers that is unlikely to happen. However, it also suggests a rethink of the standard practice of assigning three reviewers per paper -- perhaps each topic area should be covered by three qualified reviewers (of course, the reviewers would need to honestly declare this -- and not at review deadline time when it is too late to find supplementary reviewers!).

But, it is a mistake to think that peer review can ever be a perfect filter on the literature. It just isn't practical to go over every bit of data with a fine toothed comb. A current example illustrates this: a researcher has been accused of faking multiple protein structures. While some suspicion was raised when other structures of the same molecule didn't agree, the smoking gun is that the structures have systematic errors in how the atoms are packed. Is any reviewer of a structure paper really going to check all the atomic packing details? At some point, the best defense against scientific error & misconduct is to allow the entire world to scrutinize the work.

One of my professors in grad school had us first year students go through a memorable exercise. The papers assigned one week were in utter conflict with each other. We spent the entire discussion time trying to finesse how they could both be right -- what was different about the experimental procedures and how issues of experiment timing might explain the discrepancies. At the end, we asked what the resolution was, and was told "It's simple -- the one paper is a fraud". Once we knew this, we went back and couldn't believe we had believed anything -- nothing in the paper really supported its key conclusion. How had we been so blind before? A final coda to this is that the fraudulent paper is the notorious uniparental mouse paper -- and of course cloning of mice turns out to actually be possible. Not, of course, by the methods originally published and indeed at that time (mid 1970s) it would be well nigh impossible to actually prove that a mouse was cloned.

With that in mind, I will continue to blog here about papers I don't fully understand. That is one bit of personal benefit for me -- by exposing my thoughts to the world I invite criticism and will sometimes be shown the errors in my thinking. It never hurts to be reminded that skepticism is always useful, but I'll still take the risk of occasionally being suckered by P.T. Barnum, Ph.D.. This is, after all, a blog and not a scientific journal. It's meant to be a bit noisy and occasionally wrong -- I'll just try to keep the mean on the side of being correct.

Thursday, December 17, 2009

A Doublet of Solid Tumor Genomes

Nature this week published two papers describing the complete sequencing of a cancer cell line (small cell lung cancer (SCLC) NCI-H209 and melanoma COLO-829) each along with a "normal" cell line from the same individual. I'll confess a certain degree of disappointment at first as these papers are not rich in the information of greatest interest to me, but they have grown on me. Plus, it's rather churlish to complain when I have nothing comparable to offer myself.

Both papers have a good deal of similar structure, perhaps because their author lists share a lot of overlap, including the same first author. However, technically they are quite different. The melanoma sequencing used the Illumina GAII, generating 2x75 paired end reads supplemented with 50x2 paired end reads from 3-4Kb inserts, whereas the SCLC paper used 2x25 mate pair SOLiD libraries with inserts between 400 and 3000 bp.

The papers have estimates of the false positive and false negative rates for the detection of various mutations, in comparison to Sanger data. For single base pair substitutions on the Illumina platform in the melanoma sample, 88% of previously known variants were found and 97% of a sample of 470 newly found variants confirmed by Sanger. However, on small insertion/deletion (indel) there was both less data and much less success. Only one small deletion was previously known, a 2 base deletion which is key to the biology. This was not found by the automated alignment and analysis, though reads containing this indel could be found in the data. A sample of 182 small indels were checked by Sanger and only 36% were confirmed. On large rearrangements, 75% of those tested confirmed by PCR.

The statistics for the SOLiD data in SCLC were comparable. 76% of previously known single nucleotide variants were found and 97% of newly found variants confirmed by Sanger. Two small indels were previously known and neither was found and conversely only 25% of predicted indels confirmed by Sanger. 100% of large rearrangements tested by PCR validated. So overall, both platforms do well for detecting rearrangements and substitutions and are very weak for small indels.

The overall mutation hauls were large, after filtering out variants found in the normal cell line. 22,910 substitutions for the SCLC line and 33,345 in the melanoma line. Both of these samples reflect serious environmental abuse; melanomas often arise from sun exposure and the particular cancer morphology the SCLC line is derived from is characteristic of smokers (the smoking history of the patient was unknown). Both lines showed mutation spectra in agreement with what is previously known about these environmental insults. 92% of C>T single substitutions occured at the second base of a pyrimidne dimers (CC or CT sequences). CC>TT double substitutions were also skewed in this manner. CpG dinucleotides are also to be hotspots and showed elevated mutation frequencies. Transcription-coupled repair repairs the transcribed strand more efficiently than the non-transcribed strand, and in concordance with this in transcribed regions there was nearly a 2:1 bias of C>T changes on the non-transcribed strand. However, the authors state (but I still haven't quite figured out the logic) that transcription-coupled repair can account for only 1/3 of the bias and suggest that another mechanism, previously suspected but not characterized, is at work. One final consequence of transcription-coupled repair is that the more expressed a gene is in COLO-829, the lower its mutational burden. A bias of mutations towards the 3' end of transcribed regions was also observed, perhaps because 5' ends are transcribed at higher levels (due to abortive transcription). A transcribed-strand bias was also seen in G>T mutations, which may be oxidative damage.

An additional angle on mutations in the COLO-829 melanoma line is offered by the observation of copy-neutral loss of heterozygosity (LOH) in some regions. In other words, one copy of a chromosome was lost but then replaced by a duplicate of the remaining copy. This analysis is enabled by having the sequence of the normal DNA to identify germline heterozygosity. Interestingly, in these regions heterzyogous mutations outnumber homozygous ones, marking that these substitutions occurred after the reduplication event. 82% of C>T mutations in these regions show the hallmarks of being early mutations, suggesting they occured late, perhaps after the melanoma metastasized and was therefore removed from ultraviolet exposure.

In a similar manner, there is a rich amount of information in the SCLC mutational data. I'll skip over a bunch to hit the evidence for a novel transcription-coupled repair pathway that operates on both strands. The key point is that highly expressed genes had lower mutation rates on both strands than less expressed genes. A>G mutations showed a bias for the transcribed strand whereas G>A mutations occured equally on each strand.

Now, I'll confess I don't generally get excited about looking a mutation spectra. A lot of this has been published before, though these papers offer a particulary rich and low-bias look. What I'm most interested in are recurrent mutations and rearrangements that may be driving the cancer, particularly if they suggest therapeutic interventions. The melanoma line contained two missense mutations in the gene SPDEF, which has been associated with multiple solid tumors. A truncating stop mutation was found by sequencing SPDEF out of 48 additional tumors. A missense change was found in a metalloprotease (MMP28) which has previously been observed to be mutated in melanoma. Another missense mutation was found in agene which may play a role in ultraviolet repair (though it has been implicated in other processes), suggesting a tumor suppressor role. The sequencing results confirmed two out of three known driver mutations in COLO-829: the V600E activating mutation in kinase BRAF and deletion of the tumor suppressor PTEN. As noted above, the know 2 bp deletion in CDKN2A was not found through the automated process.

The SCLC sample has a few candidates for interestingly mutated genes. A fusion gene in which one partner (CREBBP) has been seen in leukemia gene fusions was found. An intragenic tandem duplication within the chromatin remodelling gene CHD7 was found which should generate an in-frame duplication of exons. Another SCLC cell line (NCI-H2171) was previously known to have a fusion gene involving CHD7. Screening of 63 other SCLC cell lines identified another (LU-135) with internal exon copy number alterations. Lu-135 was further explored by mate pair sequencing witha 3-4Kb library, which identified a breakpoint involving CHD7. Expression analysis showed high expression levels of CHD7 in both LU-135 and NCI-H2171 and a general higher expression of CHD7 in SCLC lines than non-small cell lung cancer lines and other tumor cell lines. An interesting twist is that the fusion partner in NCI-H2171 abd KY-135 is a non-coding RNA gene called PVT1 -- which is thought to be a transcriptional target of the oncogene MYC. MYC is amplified in both these cell lines, suggesting multiple biological mechanisms resulting in high expression of CHD7. It would seem reasonable to expect some high profile functional studies of CHD7 in the not too distant future.

For functional point mutations, the natural place to look is at coding regions and splice junctions, as here we have the strongest models for ranking the likelihood that a mutation will have a biological effect. In the SCLC paper an effort was made to push this a bit further and look for mutations that might affect transcription factor binding sites. One candidate was found but not further explored.

In general, this last point underlines what I believe will be different about subsequent papers. Looking mostly at a single cancer sample, one is limited at one can be inferred. The mutational spectrum work is something which a single tumor can illustrate in detail, and such in depth analyses will probably be significant parts of the first tumor sequencing paper for each tumor type, particularly other types with strong environmental or genetic mutational components. But, in terms of learnign what make cancers tick and how we can interfere with that, the real need is to find recurrent targets of mutation. Various cancer genome centers have been promising a few hundred tumors sequenced over the next year. Already at the recent ASH meeting (which I did not attend), there were over a half dozen presentations or posters on whole genome or exome sequencing of leukemias, lymphomas and myelomas -- the first ripples of the tsunami to come. But, the raw cost of targeted sequencing remains at most a 10th of the cost of an entire genome. The complete set of mutations found in either one of these papers could have been packed onto a single oligo based capture scheme and certainly a high-priority subset could be amplified by PCR without breaking the bank on oligos. I would expect that in the near future tumor sequencing papers will check their mutations and rearrangements on validation panels of at least 50 and preferable hundreds of samples (though assembling such sample collections is definitely not trivial). This will allow the estimation of the population frequency of those mutations which may recur at the level of 5-10% or more. With luck, some of those will suggest pharmacologic interventions which can be tested for their ability to improve patients' lives.

Pleasance, E., Stephens, P., O’Meara, S., McBride, D., Meynert, A., Jones, D., Lin, M., Beare, D., Lau, K., Greenman, C., Varela, I., Nik-Zainal, S., Davies, H., Ordoñez, G., Mudie, L., Latimer, C., Edkins, S., Stebbings, L., Chen, L., Jia, M., Leroy, C., Marshall, J., Menzies, A., Butler, A., Teague, J., Mangion, J., Sun, Y., McLaughlin, S., Peckham, H., Tsung, E., Costa, G., Lee, C., Minna, J., Gazdar, A., Birney, E., Rhodes, M., McKernan, K., Stratton, M., Futreal, P., & Campbell, P. (2009). A small-cell lung cancer genome with complex signatures of tobacco exposure Nature DOI: 10.1038/nature08629

Pleasance, E., Cheetham, R., Stephens, P., McBride, D., Humphray, S., Greenman, C., Varela, I., Lin, M., Ordóñez, G., Bignell, G., Ye, K., Alipaz, J., Bauer, M., Beare, D., Butler, A., Carter, R., Chen, L., Cox, A., Edkins, S., Kokko-Gonzales, P., Gormley, N., Grocock, R., Haudenschild, C., Hims, M., James, T., Jia, M., Kingsbury, Z., Leroy, C., Marshall, J., Menzies, A., Mudie, L., Ning, Z., Royce, T., Schulz-Trieglaff, O., Spiridou, A., Stebbings, L., Szajkowski, L., Teague, J., Williamson, D., Chin, L., Ross, M., Campbell, P., Bentley, D., Futreal, P., & Stratton, M. (2009). A comprehensive catalogue of somatic mutations from a human cancer genome Nature DOI: 10.1038/nature08658

Monday, December 14, 2009

Panda Genome Published!

Today's big genomics news is the advance publication in Nature of the giant panda (aka panda bear) genome sequence. For I'll be fighting someone (TNG) for my copy of Nature!

Pandas are the first bear (and alas, there is already someone making the mistaken claim otherwise in the Nature online comments) and only second member of Carnivora (after dog) with a draft sequence. Little in the genome sequence suggests that they have abandoned meat for a nearly all-plant diet, other than an apparent knockout of the taste receptor for glutamate, a key component of the taste of meat. So if you prepare bamboo for the pandas, don't bother with any MSG! But pandas do not appear to have acquired enzymes for attacking their bamboo, suggesting that their gut microflora do a lot of the work. So a panda microbiome metagenome project is clearly on the horizon. The sequence also greatly advances panda genetics: only 13 panda genes were previously sequenced.

The assembly is notable for being composed entirely of Solexa data using a mixture of library insert lengths. One issue touched on here (and I've seen commented on elsewhere) is that the longer mate pair libraries have serious chimaera issues and were not trusted to simply be fed into the assembly program, but were carefully added in a stepwise fashion (stepping up in library length) during later stages of assembly. It will be interesting to see what the Pacific Biosciences instrument can do in this regard -- instead trying to edit out the middle of large inserts by enzymatic and/or physical means, PacBio apparently has a "dark fill" procedure of pulsing unlabeled nucleotides. This leads to islands of sequence separated by signal gaps of known time, which can be be used to estimate distance. Presumably such an approach will not have chimaeras though the raw base error rate may be higher.

I'm quite confused by their Table 1, which shows the progress of their assembly as different data was added in. The confusing part is that it shows the progressive improvement in the N50 and N90 numbers with each step -- and then much worse numbers for the final assembly. The final N50 is 40Kb, which is substantially shorter than dog (close to 100Kb) but longer than platypus (13 kb). It strikes me that a useful additional statistic (or actually set of statistics) for a mammalian genome would be to calculste what fraction of core mammalian genes (which would have to be defined) are contained on a single contig (or for what fraction will you find at least 50% of the coding region in one contig).

While the greatest threat to panda's continuing existence in the wild is habitat destruction, it is heartening to find out that pandas have a high degree of genetic variability -- almost twice the heterozygosity of people. So there is apparently a lot of genetic diversity packed into the small panda population (around 1600 individuals, based on DNA sampling of scat)

BTW, no that is not the subject panda (Jingjing, who was the mascot for the Beijing Olympics) but rather my shot from our pilgrimage last summer to the San Diego Zoo. I think that is Gao Gao, but I'm not good about noting such things.

(update: forgot to put the Research Blogging bit in the post)

Li, R., Fan, W., Tian, G., Zhu, H., He, L., Cai, J., Huang, Q., Cai, Q., Li, B., Bai, Y., Zhang, Z., Zhang, Y., Wang, W., Li, J., Wei, F., Li, H., Jian, M., Li, J., Zhang, Z., Nielsen, R., Li, D., Gu, W., Yang, Z., Xuan, Z., Ryder, O., Leung, F., Zhou, Y., Cao, J., Sun, X., Fu, Y., Fang, X., Guo, X., Wang, B., Hou, R., Shen, F., Mu, B., Ni, P., Lin, R., Qian, W., Wang, G., Yu, C., Nie, W., Wang, J., Wu, Z., Liang, H., Min, J., Wu, Q., Cheng, S., Ruan, J., Wang, M., Shi, Z., Wen, M., Liu, B., Ren, X., Zheng, H., Dong, D., Cook, K., Shan, G., Zhang, H., Kosiol, C., Xie, X., Lu, Z., Zheng, H., Li, Y., Steiner, C., Lam, T., Lin, S., Zhang, Q., Li, G., Tian, J., Gong, T., Liu, H., Zhang, D., Fang, L., Ye, C., Zhang, J., Hu, W., Xu, A., Ren, Y., Zhang, G., Bruford, M., Li, Q., Ma, L., Guo, Y., An, N., Hu, Y., Zheng, Y., Shi, Y., Li, Z., Liu, Q., Chen, Y., Zhao, J., Qu, N., Zhao, S., Tian, F., Wang, X., Wang, H., Xu, L., Liu, X., Vinar, T., Wang, Y., Lam, T., Yiu, S., Liu, S., Zhang, H., Li, D., Huang, Y., Wang, X., Yang, G., Jiang, Z., Wang, J., Qin, N., Li, L., Li, J., Bolund, L., Kristiansen, K., Wong, G., Olson, M., Zhang, X., Li, S., Yang, H., Wang, J., & Wang, J. (2009). The sequence and de novo assembly of the giant panda genome Nature DOI: 10.1038/nature08696

Sunday, November 22, 2009

Targeted Sequencing Bags a Diagnosis

A nice complement to the one paper (Ng et al) I detailed last week is a paper that actually came out just before hand (Choi et al). Whereas the Ng paper used whole exome targeted sequencing to find the mutation for a previously unexplained rare genetic disease, the Choi et al paper used a similar scheme (though with a different choice of targeting platform) to find a known mutation in a patient, thereby diagnosing the patient.

The patient in question has a tightly interlocked pedigree (Figure 2), with two different consanguineous marriages shown. Put another way, this person could trace 3 paths back to one set of great-great-grandparents. Hence, they had quite a bit of DNA which was identical-by-descent, which meant that in these regions any low-frequency variant call could be safely ignored as noise. A separate scan with a SNP chip was used to identify such regions independently of the sequencing.

The patient was a 5 month old male, born prematurely at 30 weeks and with "failure to thrive and dehydration". Two spontaneous abortions and a death of another premature sibling at day 4 also characterized this family; a litany of miserable suffering. Due to imbalances in the standard blood chemistry (which, I wish the reviewers had insisted on further explanation for those of us who don't frequent that world), a kidney defect was suspected but other causes (such as infection) were not excluded.

The exome capture was this time on the Nimblegen platform, followed by Illumina sequenicng. This is not radically different from the Ng paper, which used Agilent capture and Illumina sequencing. At the moment Illumina & Agilent appear to be the only practical options for whole exome-scale capture, though there are many capture schemes published and quite a few available commercially. Lots of variants were found. One that immediately grabbed attention was a novel missense mutation which was homozygous and in a known chloride transporter, SLC26A3. This missense mutation (D652N)targets a position which is almost utterly conserved across the family, and is making a significant change in side chain (acid group to polar non-charged). Most importantly, SLC26A3 has already been shown to cause "congenital chloride-losing diarrhea" (CLD) when mutated in other positions. Clinical follow-up confirmed that fluid loss was through the intestines and not the kidneys.

One of the genetic diseases of the kidney that had been considered was Bartter syndrome, which the more precise blood chemistry did not match. Given that one patient had been suspected of Bartter but instead had CLD, the group screened 39 more patients with Bartter but lacking mutations in 4 different genes linked to this syndrome. 5 of these patients had homozygous mutations in SLC26A3, 2 of which were novel. 190 control chromosomes were also sequenced; none had mutations. 3 of these patients had further follow-up & confirmation of water loss through the gastrointestinal tract.

This study again illustrates the utility of targeted sequencing for clinical diagnosis of difficult cases. While a whole exome scan is currently in the neighborhood of $20K, more focused searches could be run far cheaper. The challenge will be in designing economical panels which will allow scanning the most important genes at low cost and designing such panels well. Presumably one could go through OMIM and find all diseases & syndromes which alter electrolyte levels and known causative gene(s). Such panels might be doable for perhaps as low as $1-5K per sample; too expensive for routine newborn screening but far better than a endless stream of tests. Of course, such panels would miss novel genes or really odd presentations, so follow-up of negative results with whole exome sequencing might be required. With newer sequencing platforms available, the costs for this may plummet to a few hundred dollars per test, which is probably on par with what the current screening of newborns for inborn errors runs. One impediment to commercial development in this field may well be the rapid evolution of platforms; companies may be hesitant that they will bet on a technology that will not last.

Of course, to some degree the distinction between the two papers is artificial. The Ng et al paper actually, as I noted, did diagnose some of their patients with known genetic disease. Similarly, the patients in this study who are now negative for known Bartter syndrome genes and for CLD would be candidates for whole exome sequencing. In the end, what matters is to make the right diagnosis for each patient so that the best treatment or supportive care can be selected.

Choi M, Scholl UI, Ji W, Liu T, Tikhonova IR, Zumbo P, Nayir A, Bakkaloğlu A, Ozen S, Sanjad S, Nelson-Williams C, Farhi A, Mane S, & Lifton RP (2009). Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proceedings of the National Academy of Sciences of the United States of America, 106 (45), 19096-101 PMID: 19861545

Thursday, November 19, 2009

Three Blows Against the Tyranny of Expensive Experiments

Second generation sequencing is great, but one of it's major issues so far is that the cost of one experiment is quite steep. Just looking at reagents, going from a ready-to-run library to sequence data is somewhere in the neighborhood of $10K-25K on 454, Illumina, Helicos or SOLiD (I'm willing to take corrections on these values, though they are based on reasonable intelligence). While in theory you can split this cost over multiple experiments by barcoding, that can be very tricky to arrange. Perhaps if core labs would start offering '1 lane of Illumina - Buy It Now!' on eBay the problem could be solved, but finding a spare lane isn't easy.

This issue manifests itself in other ways. If you are developing new protocols anywhere along the pipeline, your final assay is pretty expensive, making it challenging to work inexpensively. I've heard rumors that even some of the instrument makers feel inhibited in process development. It can also make folks a bit gun shy; Amanda heard first hand tonight from someone lamenting a project stymied under such circumstances. Even for routine operations, the methods of QC are pretty inexact so far as they don't really test whether the library is any good, just whether some bulk property (size, PCRability, quantity) is within a spec. This huge atomic cost also the huge barrier to utilization in a clinical setting; does the clinician really want to wait some indefinite amount of time until enough patient samples are queued to make the cost/sample reasonable?

Recently, I've become aware of three hopeful developments on this front. The first is the Polonator, which according to Kevin McCarthy has a consumable cost of only about $500 per run (post library construction). $500 isn't nothing to risk on a crazy idea, but it sure beats $10K. There aren't many Polonators around, but for method development in areas such as targeted capture it would seem like a great choice.

Today, another shoe fell. Roche has announced a smaller version of the 454 system, the GS Junior. While the instrument cost wasn't announced, it will supposedly generate 1/10th as much data (35+Mb from 100Kreads with 400 Q20 bases) for the same cost per basepair, suggesting that the reagent cost for a run will be in the neighborhood of $2.5K. Worse than what I described above, but rather intriguing. This is a system that may have a good chance to start making clinical inroads; $2.5K is a bit steep for a diagnostic but not ridiculous -- or you simply need to multiplex fewer samples to get the cost per sample decent. The machine is going to boast 400+bp reads, playing to the current comparative strength of the 454 chemistry. The instrument cost wasn't mentioned. While I doubt anyone would buy such a machine solely as an upfront QC for SOLiD or Illumina, with some clever custom primer design one probably could make libraries useable 454 plus one other platform.

It's an especially auspicious time for Roche to launch their baby 454, as Pacific Biosciences released some specs through GenomeWeb's In Sequence and what I've been able to scrounge about (I can't quite talk myself into asking for a subscription) this is going to put some real pressure across the market, but particularly on 454. The key specs I can find are a per run cost of $100 which will get you approximately 25K-30K reads of 1.5Kb each -- or around 45Mb of data. It may also be possible to generate 2X the data for nearly the same cost; apparently the reagents packed with one cell are really good for two run in series. Each cell takes 10-15 minutes to run (at least in some workflows) and the instrument can be loaded up with 96 of them to be handled serially. This is a similar ballpark to what the GS Junior is being announced with, though with fewer reads but longer read lengths. I haven't been able to find any error rate estimates or the instrument cost. I'll assume, just because it is new and single molecule, that the error rate will give Roche some breathing room.

But in general, PacBio looks set to really grab the market where long reads, even noisy ones, are valuable. One obvious use case is transcriptome sequencing to find alternative splice forms. Another would be to provide 1.5Kb scaffolds for genome assembly; what I've found also suggests PacBio will offer a 'strobe sequencing' mode which is akin to Helicos' dark filling technology, which is a means to get widely spaced sequence islands. This might provide scaffolding information in much larger fragments. 10Kb? 20Kb? And again, though you probably wouldn't buy the machine just for this, at $100/run it looks like a great way to QC samples going into other systems. Imagine checking a library after initial construction, then after performing hybridization selection and then after another round of selection! After all, the initial PacBio instrument won't be great for really deep sequencing. It appears it would be $5K-10K to get approximately 1X coverage of a mammalian genome -- but likely with a high error rate.

With the ability to easily sequence 96 samples at a time (though it isn't clear what sample prep will entail) does have some interesting suggestions. For example, one could do long survey sequencing of many bacterial species, with each well yielding 10X coverage of an E.coli-sized genome (a lot of bugs are this size or smaller). The data might be really noisy, but for getting a general lay-of-the-land it could be quite useful -- perhaps the data would be too noisy to tell which genes were actually functional vs. decaying pseudogenes, but you would be able to ask "what is the upper bound on the number of genes of protein family X in genome Y". if you really need high quality sequence, then a full run (or targeted sequencing) could follow.

At $100 per experiment, the sagging Sanger market might take another hit. If a quick sample prep to convert plasmids to usable form is released, then ridiculous oversampling (imagine 100K reads on a typical 1.5Kb insert in pUC scenario!) might overcome a high error rate.

One interesting impediment which PacBio has acknowledged is that they won't be able to ramp up instrument production as quickly as they might like and will be trying to place (ration) instruments strategically. I'm hoping at least one goes to a commercial service provider or a core lab willing to solicit outside business, but I'm not going to count on it.

Will Illumina & Life Technologies (SOLiD) try to create baby sequencers? Illumina does have a scheme to convert their array readers to sequencers, but from what I've seen these aren't expected to save much on reagents. Life does own the VisiGen technology, which is apparently similar to PacBio's but hasn't yet published a real proof-of-concept paper -- at least that I could find; their key patent has issued -- reading material for another night.

Tuesday, November 17, 2009

Decode -- Corpse or Phoenix?

The news that Decode has filed for bankruptcy is a sad milestone in the history of genomics companies. Thus falls either the final or penultimate human gene mapping companies, with everyone else having either disappeared entirely or exited that business. A partial list would include Sequana, Mercator, Myriad, Collaborative Research/Genome Therapeutics, Genaera and (of course) Millennium. I'm sure I'm missing some others. The one possible survivor I can think about is Perlegen, though their website is pretty bare bones, suggesting they have exited as well.

The challenge all of these companies faced, and rarely beat, was how to convert mapping discoveries into a cash stream which could pay for all that mapping. Myriad could be seen as the one success, having generated the controversial BRCA tests from their data, but (I believe) they no longer are actively looking. In new tests are in-licensed from academics.

Most other companies shed their genomics efforts as part of becoming product companies; the real money is in therapeutics. Mapping turned out to be such a weak contributor to that value stream. A major problem is that mapping information rarely led to a clear path to a therapeutic; too many targets nicely validated by genetics were complete head-scratchers as to how to create a therapeutic. Not that folks didn't try; Decode even in-licensed a drug and acquired all the pieces for a full drug development capability.

Of course, perhaps Decode's greatest notoriety came from their deCodeMe DTC genetic testing business. Given the competition & controversy in this field, that was unlikely to save them. The Icelandic financial collapse I think did them some serious damage as well. That's a reminder that companies, regardless of how they are run, sometimes have their fate channeled by events far beyond their control. A similar instance was the loss of Lion's CFO in the 9/11 attacks; he was soliciting investors at the WTC that day. The 9/11 deflation of the stock market definitely crimped a lot of money-losing biotechs plans for further fund raising.

Bankruptcies were once very rare for biotech, but quite a few have been announced recently. The old strategy of selling off the company at fire sale prices seems to be less in style these days; assets are now being sold as part of the bankruptcy proceedings. Apparently, this and perhaps other functions will continue. Bankruptcy in this case is a way of shedding incurred obligations viewed as nuisances; anyone betting on another strategy by buying the stock is out of luck.

Personally, I wish that the genetic database and biobanks which deCode have created could be transferred to an appropriate non-profit such as the Sanger. I doubt much of that data will ever be convertable into cash, particularly at the scale most investors are looking for. But a non-profit could extract the useful information and get it published, which was deCode's forte but I doubt they've mined everything that can be mined.

Sunday, November 15, 2009

Targeted Sequencing Bags a Rare Disease

Nature Genetics on Friday released the paper from Jay Shendure, Debra Nickerson and colleagues which used targeted sequencing to identify the damaged gene in a rare Mendelian disorder, Miller syndrome. The work had been presented at least in part at recent meetings, but now all of us can digest it in entirety.

The impressive economy of this paper is that they targeted (using Agilent chips) less than 30Mb of the human genome, which is less than 1%. They also worked with very few samples; only about 30 cases of Miller Syndrome have been reported in the literature. While I've expressed some reservations about "exome sequencing", this paper does illustrate why it can be very cost effective and my objections (perhaps not made clear enough before) is more a worry about being too restricted to "exomes" and less about targeting.

Only four affected individuals (two siblings and two individuals unrelated to anyone else in the study) were sequenced, each at around 40X coverage of the targeted regions. Since Miller is so vanishingly rare, the causative mutations should be absent from samples of human diversity such as dbSNP or the HapMap, so these was used as a filter. Non-synonymous (protein-altering), splice site mutations & coding indels were considered as candidates. Both dominant models and recessive models were considered. Combining the data from both siblings, 228 candidate dominant genes and 9 recessive ones fell out. Looking then to the unrelated individuals zeroed in on a single gene, DHODH, under the recessive model (but 8 in the dominant model). Using a conservative statistical model, the odds of finding this by chance were estimated at 1.5x10e-05.

An interesting curve was thrown by nature. If predictions were made as to whether mutations would be damaging, then DHODH was excluded as a candidate gene under a recessive model. Both siblings carried one allele (G605A) predicted to be neutral but another allele predicted to be damaging.

Another interesting curve is a second gene, DNAH5, which was a candidate considering only the siblings' data but ruled out by the other two individuals' data. However, this gene is already known to be linked to a Mendelian disorder. The two siblings had a number of symptoms which do not fit with any other Miller case -- and well fit the symptoms of DNAH5 mutation. So these two individuals have two rare genetic diseases!

Getting back to DHODH, is it the culprit in Miller? Sequencing three further unrelated patients found them all to be compound heterzygotes for mutations predicted to be damaging. So it becomes reasonable to infer that a false prediction of non-damaging was made for G605A. Sequencing of DHODH in parents of the affected individuals confirmed that each was a carrier, ruling out DHODH as a causative gene under a dominant model.

DHODH is known to encode dihydroorotate dehydrogenase, which catalyzes a biochemical step in the de novo synthesis of pyrimidines. This is a pathway targeted in some cancer chemotherapies, with the unfortunate result that some individuals are exposed to these drugs in utero -- and these persons manifest symptoms similar to Miller syndrome. Furthermore, another genetic disease (Nagler) has great overlap in symptoms with Miller -- but sequencing of DHODH in 12 unrelated patients failed to find any coding mutations in DHODH.

The authors point to the possible impact of this approach. They note that there are 7,000 diseases which affect fewer than 200K patients in the U.S. (a widely used definition of rare disease), but in aggregate this is more than 25M persons. Identifying the underlying mutations for a large fraction of these diseases would advance our understanding of human biology greatly, and with a bit of luck some of these mutations will suggest practical therapeutic or dietary approaches which can ameliorate the disease.

Despite the success here, they also underline opportunities for improvement. First, in some cases variant calling was difficult due to poor coverage in repeated regions. Conversely, some copy number variation manifested itself in false positive calls of variation. Second, the SNP databases for filtering will be most useful if they are derived from similar populations; if studying patients with a background poorly represented in dbSNP or HapMap then those databases won't do.

How economical a strategy would this be? Whole exome sequencing on this scale can be purchased for a bit under $20K/individual; to try to do this by Sanger would probably be at least 25X that. So whole exome sequencing of the 4 original individuals would be less than $100K for sequencing (but clearly a bunch more for interpretation, sample collection, etc). The follow-up sequencing would a add a bit, but probably less than one exome's worth of sequencing. Even if a study turned up a lot of candidate variants, smaller scale targeted sequencing can be had for $5K or less per sample. Digging into the methods, the study actually used two passes of array capture -- the second to clean up what wasn't captured well by the first array design & to add newer gene predictions. This is a great opportunity to learn from these projects -- the array designs can keep being refined to provide even coverage across the targeted genes. And, of course, as the cost per base of the sequencing portion continues its downwards slide this will get even more attractive -- or possibly simply be displaced by really cheap whole genome sequencing. If the cost of the exome sequencing can be approximately halved, then perhaps a project similar to this could be run for around $100K.

So, if 700 diseases could each be examined at 100K/disease, that would come out to $70M -- hardly chump change. This underlines the huge utility of getting sequencing costs down another order of magnitude. At $1000/genome, the sequencing costs of the project would stop grossly overshadowing the other key areas - sample collection & data interpretation. If the total cost of such a project could be brought down closer to $20K, then now we're looking at $14M to investigate all described rare genetic disorders. That's not to say it shouldn't be done at $70M or even several times that, but ideally some of the money saved by cheaper sequencing could go to elucidating the biology of the causative alleles such a campaign would unearth, because certainly many of them will be much more enigmatic than DHODH.

Sarah B. Ng, Kati J. Buckingham, Choli Lee, Abigail W. Bigham, Holly K. Tabor, Karin M. Dent, Chad D. Huff, Paul T. Shannon, Ethylin Wang Jabs, Deborah A. Nickerson, Jay Shendure, & Michael J. Bamshad (2009). Exome sequencing identifies the cause of a mendelian disorder Nature genetics : doi:10.1038/ng.499

Thursday, November 12, 2009

A 10,201 Genomes Project

With valuable information emerging from the 1000 (human) genomes project and now a proposal for a 10,000 vertebrate genome project, it's well past time to expose to public scrutiny a project I've been spitballing for a while, which I now dub the 10,201 genomes project. Why that? Well, first it's a bigger number than the others. Second, it's 101 squared.

Okay, perhaps my faithful assistant is swaying me, but I still think it's a useful concept, even if for the time being it must remain a gehunden experiment. All kidding aside, the goal would be to sequence the full breadth of caninity with the prime focus on elucidating the genetic machinery of mammalian morphology. In my biological world, that would be more than enough to justify such a project once the price tag comes down to a few million. With some judicious choices, some fascinating genetic influences on complex behaviors might also emerge. And yes, there is a possibility of some of this feeding back to useful medical advances, though one should be honest to say that this is likely to be a long and winding road. It really devalues saying something will impact medicine when we claim every project will do so.

The general concept would be to collect samples from multiple individuals of every known dog breed, paying attention to important variation within breed standards. It would also be valuable to collect well-annotated samples from individuals who are not purebred but exhibit interesting morphology. For example, I've met a number of "labradoodles" (Labrador retriever x poodle) and they exhibit a wide range of sizes, coat colors and other characteristics -- precisely the fodder for such an experiment. In a similar manner, it is said that the same breed from geographically distant breeders may be quite distinct, so it would be valuable to collect individuals from far-and-wide. But going beyond domesticated dogs, it would be useful to sequence all the wild species as well. With genomes at $1K a run, this would make good sense. Of particular interest for a non-dog genome is the case of lines of foxes. which have been bred over just a half century into a very docile line and a second selected for aggressive tendencies.

What realistically could we expect to find? One would expect a novel gene, as is the case with short legged breeds, to leap out. Presumably regions which have undergone selective sweeps would be spottable as well and linkable to traits. A wealth of high-resolution copy number information would certainly emerge.

Is it worth funding? Well, I'm obviously biased. But already the 10,000 vertebrate genome has kicked up some dust from some who are disappointed that the genomics community has not had "an inordinate fondness for beetles" (only one sequenced so far). Genome sequencing is going to get much cheaper, but never "too cheap to meter". De novo projects will always be inherently more expensive due to more extensive informatics requirements -- the first annotation of the genome is highly valuable but requires extensive effort. I too am disappointed that greater sampling of arthropods hasn't been sequenced -- and it's hard to imagine folks in the evo-devo world being fond of this point either.

It's hard for me to argue against sequencing thousands of human germlines to uncover valuable medical information or to sequence tens of thousands of somatic cancer genomes for the same reason. But, even so I'd hate to see that push out funding for filling in more information about the tree of life. Still, do we really need 10,000 vertebrate genomes in the near future or 10,201 dog genomes? If the trade for doing only 5,000 additional vertebrates is doing 5,000 diverse invertebrates, I think that is hard to argue against. Depth vs. breadth will always be a challenging call, but perhaps breadth should be favored a bit more -- at least once I'm funded for my ultra-deep project!

Wednesday, November 11, 2009

A call for new technological minds for the genome sequencing instrument fields

There's a great article in the current Nature Biotechnology (alas, you'll need a subscription to read the full text) titled "The challenges of sequencing by synthesis" as this post detailing the challenges around the current crop of sequencing-by-synthesis instruments. The paper was written by a number of the PIs on grants for $1K genome technology.

While there is one short section on the problem of sample preparation, the heart of the paper can be found in the other headings:

surface chemistry
fluorescent labels
the enzyme-substrate system
optics
throughput versus accuracy
read-length and phasing limitations

Each section is tightly written and well-balanced, with no obvious playing of favorites or bashing of anti-favorites present. Trade-offs are explored & the dreaded term (at least amongst scientists) "cost models" shows up; indeed there is more than a little bit of a nod to accounting -- but if sequencing is really going to be $1K/person on an ongoing basis the beans must be counted correctly!

I won't try to summarize much in detail; it really is hard to distill such a concentrated draught any further. Most of the ideas presented as possible solutions can be viewed as evolutionary relative to the current platforms, though a few exotic concepts are floated as well (such as synthetic aperture optics. It is noteworthy that an explicit goal of the paper is to summarize the problem areas so that new minds can approach the problem; as implied by the section title list above this is clearly a multi-discipline problem. It does somewhat suggest the question whether Nature Biotechnology, a journal I am quite fond of, was the best place for this. If new minds are desired, perhaps Physical Review Letters would have been better. But that's a very minor quibble.

Fuller CW, Middendorf LR, Benner SA, Church GM, Harris T, Huang X, Jovanovich SB, Nelson JR, Schloss JA, Schwartz DC, & Vezenov DV (2009). The challenges of sequencing by synthesis. Nature biotechnology, 27 (11), 1013-23 PMID: 19898456

Tuesday, November 10, 2009

Occult Genetic Disease

A clinical aside by Dr. Steve over at Gene Sherpas piqued my interested recently. He mentioned a 74 year old female patient of his with lung difficulties who turned out positive both by the sweat test and genetic testing for cystic fibrosis. One of her grandchildren had CF, which appears to have been a key hint in this direction. This anecdote was particularly striking to me because I had recently finished Atul Gawande's "Better" (highly recommended), which had a chapter on CF. Even today, a well treated CF patient living to such an age would be remarkable; when this woman was born living to 20 would be lucky. Clearly she either has a very modest deficit or some interesting modifier or such (late onset?) which allowed her to live to this age.

Now, if this patient didn't have any CF in her family, would one test for this? Probably not. But thinking more broadly, will this scenario be repeated frequently in the future when complete genome sequencing becomes a routine part of large numbers of medical files? Clearly we will have many "variants of unknown significance", but will we also find many cases of occult (hidden) genetic disease in which a patient shows clinical symptoms (but perhaps barely so). Having a sensitive and definitive phenotypic test will assist this greatly; showing excess saltiness of sweat is pretty clear.

From a clinical standpoint, many of these patients may be confusing -- if someone is nearly asymptomatic should they be treated? But from a biology standpoint, they should prove very informative by helping us define the biological thresholds of disease or by uncovering modifiers. Even more enticing would be the very small chance of finding examples of partial complementation -- cases where two defective alleles somehow work together to generate enough function. One example I've thought of (admittedly a bit far-fetched, but not total science fiction) would be two alleles which each produce a protein subject to instability but when heterodimerized stabilize the protein just enough.

Thursday, October 29, 2009

My Most Expensive Paper

Genome Research has a paper detailing the Mammalian Gene Collection (MGC), and if you look way down on the long author list (which includes Francis Collins!) you'll see mine there along with two Codon Devices colleagues. This paper cost me a lot -- nothing in legal tender, but a heck of a lot of blood, sweat & tears.

The MGC is an attempt to have every human & mouse protein coding sequence (plus more than a few rat)available as an expression clone, with native sequence. Most of the genes were cloned from cDNA libraries, but coding sequences which couldn't be found that way were farmed out to a number of synthetic biology companies. Codon decided to take on a particularly challenging tranche of mostly really long ORFs, hoping to demonstrate our proficiency in this difficult task.

At the start, the attitude was "can-do". When it appeared we couldn't parse some targets into our construction scheme, I devised a new algorithm that captured a few more (which I blogged about cryptically). It was going to be a huge order which would fill our production pipeline in a expansive new facility we had recently moved into, replacing a charming but cramped historic structure. A new system for tracking constructs through the facility was about to be rolled out that would let us finally track progress across the pipeline without a human manager constantly looking over each plasmid's shoulder. The delivery schedule for MGC was going to be aggressive but would show our chops. We were going to conquer the world!

Alas, almost as soon as we started (and had sunk huge amounts of cash into oligos) we discovered ourselves in a small wicker container which was growing very hot. Suddenly, nothing was working in the production facility. A combination of problems, some related to the move (a key instrument incorrectly recalibrated)and another problem whose source was never quite nailed down forced a complete halt to all production activity for several months -- which soon meant that MGC was going to be the only trusty source of revenue -- if we could get MGC to release us from our now utterly undoable delivery schedule.

Eventually, we fixed the old problems & got new processes in place and pushed a bunch of production forward. We delivered a decent first chunk of constructs to MGC, demonstrating that we were for real (but still with much to deliver). Personnel were swiped from the other piece of the business (protein engineering) to push work forward. More and more staff came in on weekends to keep things constantly moving.

Even so, trouble still was a constant theme. Most of the MGC project were large constructs, which were built by a hierarchical strategy. Which means the first key task was to build all the parts -- and some parts just didn't want to be built. We had two processes for building "leaves", and both underwent major revisions and on-the-fly process testing. We also started screening more and more plasmids by sequencing, sometimes catching a single correct clone in a mountain of botched ones (but running up a higher and higher capillary sequencing bill). Sometimes we'd get almost right pieces, which could be fixed by site directed mutagenesis -- yet another unplanned cost in reagents & skilled labor. I experimented with partial redesigns of some builds -- but with the constraint of not ordering more costly oligos. Each of these pulled in a few more constructs, a few more delivered -- and a frustrating pile of still unbuilt targets.

Even when we had all the parts built, the assembly of them to the next stage was failing at alarming rates -- usually by being almost right. Yet more redesigns requiring fast dancing by the informatics staff to support. More constructs pushed through. More weekend shifts.

In the end, when Codon shut down its gene synthesis business -- about 10 months after starting the MGC project -- we delivered a large fraction of our assignment -- but not all of it. For a few constructs we delivered partial sequences for partial credit. It felt good to deliver -- and awful to not deliver.

Now, given all that I've described (and more I've left out), I can't help but be a bit guilty about that author list. It was decided at some higher level that the author list would not be several miles long, and so some sort of cut had to be made. Easily 50 Codon employees played some role in the project, and certainly there were more than a dozen for whom it occupied a majority of their attention. An argument could have been easily made for at least that many Codon authors. But, the decision was made that the three of us who had most shared the project management aspect would go on the paper. In my case, I had ended up the main traffic cop, deciding which pieces needed to be tried again through the main pipeline and which should be directed to the scientist with magic hands. For me, authorship is a small token for the many nights I ran SQL queries at midnight to find out what had succeeded and what had failed in sequencing -- and then checked again at 6 in the morning before heading off to work. Even on weekends, I'd be hitting the database in the morning & night to find out what needed redirecting -- and then using SQL inserts to redirect them. I realized I was on the brink of madness when I was sneaking in queries on family ski weekend.

Perhaps after such a checkered experience it is natural to question the whole endeavor. The MGC effort means that researchers who want to express a mammalian protein from a native coding sequence can do so. But how much of what we built will actually get used? Was it really necessary to build the native coding sequence -- which often gave us headaches in the builds from repeats & GC-rich regions (or, as we belatedly discovered, certain short runs of G could foul us up)? MGC is a great resource, but the goal of a complete catalog of mammalian genes wasn't realized -- some genes still aren't available from MGC or any of the commercial human gene collections.

MGC also torture-tested Codon's construction processes, and the original ones failed badly. Our in-progress revisions fared much better, but still did not succeed as frequently as they should have. when we could troubleshoot things, we could ascribe certain failures to almost every conceivable source -- bad enzymes, a bad oligo well, failure to follow procedures, laboratory mix-ups, etc. But an awful lot could not be pinned to any cause, despite investigation, suggesting that we simply did not understand our system well enough to use it in a high-throughput production environment.

I do know one thing: while I hope to stay where I am for a very long time, should I ever be looking for a job again I will avoid a production facility. Some gene synthesis projects were worse than MGC in terms of demanding customers with tight timelines (which is no knock on the customers; now I'm that customer!), but even with MGC I found it's just not the right match for me. It's no fun to burn so much effort on just getting something through the system so that somebody else can do the cool biology. I don't ever want to be in a situation where I'm on vacation and thinking about which things are stalled in the line. Some people thrive in the environment; I found it draining.

But, there is something to be said for the experience. I learned a lot which can be transferred to other settings. That which doesn't kill us makes us stronger -- MGC must have made me Superman.

Monday, October 26, 2009

DTC CNVs?

Curiosity question: do the current DTC genomics companies report out copy number variations (CNVs) to their customers? Are any of their technologies unable to read these? Clearly Knome (or Illumina, which isn't DTC but sort of competing with them) should be able to get this info from the shotgun sequencing. But what about the array-based companies such as Navigenics & 23andMe? My impression is that any high density SNP array data can be mined for copy number info, but perhaps there are caveats or restrictions on that.

It would seem that with CNVs so hot in the literature and a number of complex diseases being associated to them, this would be something the DTC companies would jump at. But have they?

Saturday, October 24, 2009

Now where did I misplace that genome segment of mine?

One of the many interesting ASHG tidbits from the Twitter feed is a comment from "suganthibala" which I'll quote in full

On average we each are missing 123 kb. homozygously. An incomplete genome is the norm. What a goofy species we are.

.

I'm horribly remiss in tracking the CNV literature, but this comment makes me wonder whether this is atypical at all. How extensively has this been profiled in other vertebrate species and how do other species look in terms of the typical amount of genome missing? I found two papers for dogs, one of which features a former lab mate as senior author and the other one has Evan Eichler in the author list. Some work has clearly been done in mouse as well.

Presumably there is some data for Drosophila, but how extensive? Are folks going through their collections of D. melanogaster collected from all of the world and looking for structural variation? With a second gen sequencer, this would be straightforward to do -- though a lot of libraries would need to be prepped! Many flies could be packed into one lane of Illumina data, so this would take some barcoding. Even cheaper might be to do it on a Polonator (reputed to cost about $500 in consumables per run (not including library prep).

Attacking this by paired-end/mate-pair NGS rather than arrays (which have been the workhorse so far) would enable detecting balanced rearrangements, which arrays are blind to (though there is another tweeted item that Eichler states "Folks you can't get this kind of information from nextgen sequencing; you need old-fashioned capillaries" -- I'd love to hear the background on that) That leads to another proto-thought: will the study of structural variation lead to better resolution of the conundrum of speciation and changes in chromosome structure -- i.e. it's easy to see how such rearrangments could lead to reproductive isolation but not easy to see how they wouldn't be sufficiently non-isolating to allow for enough founders.

ASHG Tweets: Minor Fix or Slow Torture?

Okay, I'll admit it: I've been ignoring Twitter. It doesn't help that I never really learned to text (I might have sent one in my life). Maybe if I ever get a phone with a real keyboard, but even then I'm not sure. Live blogging from meetings seemed a bit interesting -- but in those tiny packets? I even came up with a great post on Twitter -- alas a few days after the first of April, when it would have been appropriate.

But now I've gotten myself hooked on the Twitter feed coming from attendees at the American Society for Human Genetics. It's an interesting mix -- some well established bloggers, lots of folks I don't know plus various vendors hawking their booths or off-conference tours and such. Plus, you don't even need a Twitter account!

The only real problem is its really making me wish I was there. I've never been to Hawaii, despite a nearly lifelong interest in going. And such a cool meeting! But, you can't go to every meeting unless your a journalist or event organizer (or sales rep!), so I had to stay home and get work done.

I suspect I'm hooked & will be repeating this exercise whenever I miss good conferences. Who knows? Maybe I'll catch the Twitter bug yet!

Thursday, October 22, 2009

Physical Maps IV: Twilight of the Clones?

I've been completely slacking on completing my self-imposed series on how second generation sequencing (I'm finally trying to kick the "next gen" term) might reshape the physical mapping of genomes. It hasn't been that my brain has been ignoring the topic, but somehow I've not extracted the thoughts through my fingertips. And I've figured out part of the reason for my reticence -- my next installment was supposed to cover BACs and other clone-based maps, and I'm increasingly thinking these aren't going to be around much longer.

Amongst the many ideas I turned over was how to adapt BACs to the second generation world. BACs are very large segments -- often a few hundred kilobases -- cloned into low copy (generally single copy) vectors in E.coli.

One approach would be to simple sequence the BACs. One key challenge is that a single BAC is poorly matched to a second generation sequencer; even a single lane of a sequencer is gross overkill. So good high-throughput multiplex library methods are needed. Even so, there will be a pretty constant tax of resequencing the BAC vector and the inevitable contaminating host DNA in the prep. That's probably going to run about 10% wastage -- not unbearable but certainly not pretty.

Another type of approach is end-sequencing. for this you really need long reads, so 454 is probably the only second generation machine suitable. But, you need to smash down the BAC clone to something suitable for emulsion PCR. I did see something in Biotechniques on a vectorette PCR to accomplish this, so it may be a semi-solved problem.

A complementary approach is to landmark the BACs, that is to identify a set of distinctive features which can be used to determine which BACs overlap. At the Providence conference one of the posters discussed getting 454 reads from defined restriction sites within a BAC.

But, any of these approaches still require picking the individual BACs and prepping DNA from them and performing these reactions. While converting to 454 might reduce the bill for the sequence generation, all that picking & prepping is still going to be expensive.

BACs baby cousins are fosmids, which are essentially the same vector concept but designed to be packaged into lambda phage. Fosmids carry approximately 40Kb of DNA. I've already seen ads from Roche/454 claiming that their 20Kb mate pair libraries obviate the need for fosmids. While 20Kb is only half the span, many issues that fosmids solve are short enough to be fixed by a 20Kb span, and the 454 approach enables getting lots of them.

This is all well and good, but perhaps its time to look just a little bit further ahead. Third generation technologies are getting close to reality (those who have early access Pacific Biosciences machines might claim they are reality). Some of the nanopore systems detailed in Rhode Island are clearly far away from being able to generate sequences you would believe. However, physical mapping is a much less demanding application than trying to generate a consensus sequence or identify variants. Plenty of times in my early career it was possible using BLAST to take amazingly awful EST sequences and successfully map them against known cDNAs.

Now, I don't have any inside information on any third generation systems. But, I'm pretty sure I saw a claim that Pacific Biosciences has gotten reads close to 20Kb. Now, this could have been a "magic read" where all the stars were aligned. But imagine for a moment if this technology can routinely hit such lengths (or even longer) -- albeit with quality that makes it unusable for true sequencing but sufficient for aligning to islands of sequence in a genome assembly. If such a technology could generate sufficient numbers of such reads in reasonable time, the 454 20Kb paired libraries could start looking like buggy whips.

Taking this logic even further, suppose one of the nanopore technologies could really scan very long DNAs, perhaps 100Kb or more. Perhaps the quality is terrible, but again, as long as its just good enough. For example, suppose the error rate was 15%, or a phred 8 score. AWFUL! But, in a sequence of 10,000 (standing for the size of a fair-sized sequence island in an assembly) you'd expect to find nearly 3 runs of 50 correct bases. Clearly some clever algorithmics would be required (especially since with nanopores you don't know which direction the DNA is traversing the pore), but this would suggest that some pretty rotten sequencing could be used to order sequence islands along long reads.

Yet another variant on this line of thinking would be to use nanopores to read defined sequence landmarks from very long fragments. Once you have an initial assembly, a set of unique sequences can be selected for synthesis on microarrays. While PCR is required to amplify those oligos, it also offers an opportunity to subdivide the huge pool. Furthermore, with sufficiently long oligos on the chip one could even have multiple universal primer targets per oligo, enabling a given landmark to be easily placed in multiple orthogonal pools. With an optical nanopore reading strategy, 4 or more color-coded pools could be hybridized simultaneously and read. Multiple colors might be used for more elaborate coding of sequence islands -- i.e. one island might be encoded with a series of flashing lights, much like some lighthouses. Again, clever algorithmics would be needed to design such probe strategies.

How far away would such ideas be? Someone more knowledgeable about the particular technologies could guess better than I could. But, it would certainly be worth exploring, at least on paper, for anyone wanting to show that nanopores are close to prime time. While really low quality reads or just landmarking molecules might not seem exciting, it would offer a chance to get the technology into routine operation -- and from such routine operation comes continuous improvement. In other words, the way to push nanopores into routine sequencing might be by carefully picking something other than sequence -- but making sure that it is a path to sequencing and not a detour.

Wednesday, October 14, 2009

Why I'm Not Crazy About The Term "Exome Sequencing"

I find myself worrying sometimes that I worry too much about the words I use -- and worry some of the rest of the time that I don't worry enough. What can seem like the right words at one time might seem wrong some other time. The terms "killer app" are thrown around a lot in the tech space, but would you really want to hear it used about sequencing a genome if you were the patient whose DNA was under scrutiny?

One term that sees a lot of traction these days is "exome sequencing". I listened in on a free Science magazine webinar today on the topic, and the presentations were all worthwhile. The focus was on the Nimblegen capture technology (Roche/Nimblegen/454 sponsored the webinar), though other technologies were touched on.

By "exome sequencing" what is generally meant is to capture & sequence the exons in the human genome in order to find variants of interest. Exons have the advantage of being much more interpretable than non-coding sequences; we have some degree of theory (though quite incomplete) which enables prioritizing these variants. The approach also has the advantage of being significantly cheaper at the moment than whole genome sequencing (one speaker estimated $20K per exome). So what's the problem?

My concern is that the terms "exome sequencing" are taken a bit too literally. Now, it is true that these approaches catch a bit of surrounding DNA due to library construction and the targeting approaches cover splice junctions, but what about some of the other important sequences? According to my poll of practitioners of this art, their targets are entirely exons (confession: N=1 for the poll).

I don't have a general theory for analyzing non-coding variants, but conversely there are quite a few well annotated non-coding regions of functional significance. An obvious case are promoters. Annotation of human promoters and enhancers and other transcriptional doodads is an ongoing process, but some have been well characterized. In particular, the promoters for many drug metabolizing enzymes have been scrutinized because these may have significant effects on how much of the enzyme is synthesized and therefore drug metabolism.

Partly coloring my concern is the fact that exome sequencing kits are becoming standardized; at least two are on the market currently. Hence, the design shortcomings of today might influence a lot of studies. Clearly sequencing every last candidate promoter or enhancer would tend to defeat the advantages of exome sequencing, but I believe a reasonable shortlist of important elements could be rapidly identified.

My own professional interest area, cancer genomics, adds some additional twists. At least one major cancer genome effort (at the Broad) is using exome sequencing. On the one hand, it is true that there are relatively few recurrent, focused non-coding alterations documented in cancer. However, few is not none. For example, in lung cancer the c-Met oncogene has been documented to be activated by mutations within an intron; these mutations cause skipping of an exon encoding an inhibitory domain. Some of these alterations are about 50 nucleotides away from the nearest splice junction -- a distance that is likely to result in low or no coverage using the Broad's in solution capture technology (confession #2: I haven't verified this with data from that system).

The drug metabolizing enzyme promoters I mentioned before are a bit greyer for cancer genomics. On the one hand, one is generally primarily interested in what somatic mutations have occurred on the tumor. On the other hand, the norm in cancer genomics is tending towards applying the same approach to normal (cheek swab or lymphocyte) DNA from the patient, and why not get the DME promoters too? After all, these variants may have influenced the activity of therapeutic agents or even development of the disease. Just as some somatic mutations seem to cluster enigmatically with patient characteristics, perhaps some somatic mutations will correlate with germline variants which contributed to disease initiation.

Whatever my worries, they should be time-limited. Exome sequencing products will be under extreme pricing pressure from whole genome sequencing. The $20K cited (probably using 454 sequencing) is already potentially matched by one vendor (Complete Genomics). Now, in general the cost of capture will probably be a relatively small contributor compared to the cost of data generation, so exome sequencing will ride much of the same cost curve as the rest of the industry. But, it probably is $1-3K for whole exome capture due to the multiple chips required and the labor investment (anyone have a better estimate?). If whole mammalian genome sequencing really can be pushed down into the $5K range, then mammalian exome sequencing will not offer a huge cost advantage if any. I'd guess interest in mammalian exome sequencing will peak in a year or two, so maybe I should stop worrying and learn to love the hyb.

Friday, October 09, 2009

Bad blog! Bad, bad, bad blog!

Thanks to Dan Koboldt from Mass Genomics, I've discovered that another blog (the Oregon Personal Injury Law Blog had copied my breast cancer genome piece. Actually, it appears that since it started this summer it may have copied every one of my posts here at Omics! Omics! without any attribution or apparent linking back. I've left a comment (which is moderated) protesting this.

curiously, the author of this blog (I assume it has one) doesn't seem to have left any identifying information or contact info, so for the moment the comments section is my only way of communicating. Perhaps this is some sort of wierd RSS-driven bug; that's the only charitable explanation I can contemplate. But it is strange -- most of these have no possible link to personal injury -- or can PNAS sue me for complaining about their RSS feed?

We'll see if the author fixes this, or at least replies with something along the lines of "head down, ears flat & tail between the legs".

Just to double-check the RSS hypothesis, I'm actually going to explicitly sign this one -- Keith Robison from Omics! Omics!.

Nano Anglerfish Snag Orphan Enzymes

The new Science has an extremely impressive paper tackling the problem of orphan enzymes. Due primarily to Watson-Crick basepairing, our ability to sequence nucleic acids has shot far past our ability to characterize the proteins they may encode. If I want to measure an RNA's expression, I can generate an assay almost overnight by designing specific real-time PCR (aka RT-PCR aka TaqMan) probes. If I want to analyze any specific protein's expression, it generally involves a lot of teeth gnashing & frustration. If you're lucky, there is a good antibody for it -- but most times there is either no antibody or one of unknown (and probably poor) character. Mass spec based methods continue to improve, but still don't have an "analyze any protein in any biological sample anytime" character (yet?).

One result of this is that there are a lot of ORFs of unknown function in any sequenced genome. Bioinformatic approaches can make guesses for many of these and those guesses are often around enzymatic activity, but a bioinformatic prediction is not proof and the predictions are often quite vague (such as "hydrolase"). Structural genomics efforts sometimes pull in additional proteins whose sequence didn't resemble anything of known function, but whose structure has enzymatic characteristics such as nucleotide binding pockets. There have been one or two of such structures de-orphaned by virtual screening, but these are a rarity.

Attempts have been made at high-throughput screening of enzyme activities. For example, several efforts have been published in which cloned libraries of proteins from a proteome were screened for enzyme activity. While these produced initial papers, they've never seemed to really catch fire.

The new paper is audacious in providing an approach to detecting enzyme activities and subsequently identifying the responsible proteins, all from protein extracts. The key trick is an array of golden nano anglerfish -- well, that's how I imagine it. Like an anglerfish, the gold nanoparticles dangle their chemical baits off long spacers (poly-A, of all things!). In reverse of an anglerfish, the bait complex glows after it has been taken by its prey, with a clever unquenching mechanism activating the fluorophore and marking that a reaction took place. But the real kicker is that like an anglerfish, the nanoparticles seize their prey! Some clever chemistry around a bound Cobalt ion (which I won't claim to understand)results in linking the enzyme to the nanoparticle, from which it can be cleaved, trypsinized and identified by mass spectrometry. 1676 known metabolites and 807 other compounds of interest were immobilized in this fashion.

As one test, the researchers applied separately extracts of the bacteria Pseudomonas putida and Streptomyces coelicolor to arrays. Results were in quite strong agreement with the existing bioinformatic annotations of these organisms, in that the P.putida extract's pattern of metabolized and not metabolized substrates strongly coincided with what the informatics would predict and the same was true for S.coelicolor (with a P<5.77^-177 for the latter!). But, agreement was not perfect -- each species catalyzed additional reactions on the array which were absent from the databases. By identifying the bound proteins, numerous assignments were made which were either novel or significant refinements of the prior annotation. Out of 191 proteins identified in the P.putida set, 31 hypothetical proteins were assigned function, 47 proteins were assigned a different function and the previously ascribed function was confirmed for the remaining 113 proteins.

Further work was done with environmental samples. However, given the low protein abundance from such samples, these were converted into libraries cloned into E.coli and then the extracts from these E.coli strains analyzed. Untransformed E.coli was used to estimate the backgrounds to subtract -- I must confess a certain disappointment that the paper doesn't report any novel activities for E.coli, though it isn't clear that they checked for them (but how could you not!). The samples came from three extreme environments -- one from a hot, heavy metal rich acidic pool, one from oil-contaminated seawater and a third from a deep sea hypersaline anoxic region. From each sample a plethora of enzyme activities were discovered.

Of course, there are limits to this approach. The tethering mechanism may interfere with some enzymes acting on their substrates. It may, therefore, be desirable to place some compounds multiple times on the array but with the linker attached at different points. It is unlikely we know all possible metabolites (particularly for strange bugs from strange places), so some enzymes can't be deorphaned this way. And sensitivity issues may challenge finding some enzyme activities if very few copies of the enzyme are present.

On the other hand, as long as these issues are kept in mind this is an unprecedented & amazing haul of enzyme annotations. Application of this method to industrially important fungi & yeasts is another important area, and certainly only the bare surface of the bacterial world was scratched in this paper. Arrays with additional unnatural -- but industrially interesting -- substrates are hinted at in the paper. Finally, given the reawakened interest in small molecule metabolism in higher organisms & their diseases (such as cancer), application of this method to human samples can't be far behind.

Ana Beloqui, María-Eugenia Guazzaroni, Florencio Pazos, José M. Vieites, Marta Godoy, Olga V. Golyshina,, Tatyana N. Chernikova, Agnes Waliczek, Rafael Silva-Rocha, Yamal Al-ramahi, Violetta La Cono, Carmen Mendez, José A. Salas, Roberto Solano, Michail M. Yakimov, Kenneth N. Timmis, Peter N. Golyshin, & Manuel Ferrer (2009). Reactome array: Forging a link between metabolome and genome Science, 326 (5950), 252-257 : 10.1126/science.1174094

Wednesday, October 07, 2009

The genomic history of a breast cancer revealed

Today's Nature contains a great paper which is one more step forward for cancer genomics. Using Illumina sequencing a group in British Columbia sequenced both the genome and transcriptome of a metastatic lobular (estrogen receptor positive) breast cancer. Furthermore, they searched a sample of the original tumor for mutations found in the genome+transcriptome screen in order to identify those that may have been present early vs. those which were acquired later.

From the combined genome sequence and RNA-Seq data they found 1456 non-synonymous changes which was then trimmed to 1178 after removing pseudogenes and HLA sequences. 1120 of these could be re-assayed by Sanger sequencing of PCR amplicons from both normal DNA and the metastatic samples -- 437 of these were confirmed. Most of these (405) were found in the normal sample. Of the 32 remaining, 2 were found only in the RNA-Seq data, a point to be addressed later below. Strikingly, none of the mutated genes were found in the previous whole-exome sequencing (by PCR+Sanger) of breast cancer, though those samples were of a different subtype (estrogen receptor negative).

There are a bunch of cool tidbits in the paper, which I'm sure I won't give full justice to here but I'll do my best. For example, several other papers using RNA-Seq on solid cancers have identified fusion proteins, but in this paper none of the fusion genes suggested by the original sequencing came through their validation process. Most of the coding regions with non-synonymous mutations have not been seen to be mutated before in breast cancer, though ERBB2 (HER2, the target of Herceptin) is in the list along with PALB2, a gene which when mutated predisposes individuals to several cancers (and is also associated with BRCA2). The algorithm (SNVMix) used for SNP identification & frequency estimation is a good example of an easter egg, a supplementary item that could easily be its own paper.

One great little story is HAUS3. This was found to have a truncating stop codon mutation and the data suggests that the mutation is homozygous (but at normal copy number) in the tumor. A further screen of 192 additional breast cancers (112 lobular and 80 ductal) for several of the mutations found no copies of the same hits seen in this sample, but two more truncating mutations in HAUS3 were found (along with 3 more variations in ERBB2 within the kinase domain, a hotspot for cancer mutations). HAUS3 is particularly interesting because until about a year ago it was just C4orf15, an anonymous ORF on chromosome 15. Several papers have recently described a complex ("augmin") which plays a role in genome stability, and HAUS3 is a component of this complex. This starts smelling like a tumor suppressor (truncating mutations seen repeatedly; truncating mutation homozygous in tumor; protein in function often crippled in cancer), and I'll bet HAUS3 will be showing up in some functional studies in the not too distant future.

Resequencing of the primary tumor was performed using amplicons targeting the mutations found in the metastatic tumor. These amplicons were small enough to be spanned directly by paired-end Illumina reads, obviating the need for library construction (a trick which has shown up in some other papers). By using Illumina sequencing for this step, the frequency of the mutation in the sample could be estimated. It is also worth noting that the primary tumor sample was a Formalin Fixed Paraffin Embedded slide, a way to preserve histology which is notoriously harsh on biomolecules and prone to sequencing artifacts. Appropriate precautions were made, such as sequencing two different PCR amplifications from two different DNA extractions. The sequencing of the primary tumor suggests that only 10 of the mutations were present there, with only 4 of these showing a frequency consistent with being present in the primary clone and the others probably being minor components. This is another important filter to suggest which genes are candidates for being involved in early tumorigenesis and which are more likely late players (or simply passengers).

One more cool bit I parked above: the 2 variants seen only in the RNA-Seq library. This suggested RNA editing and also consistent with this an RNA editase (ADAR) was found to be highly represented in the RNA-Seq data. Two genes (COG3 and SRP9) showed high frequency editing. RNA editing is beginning to be recognized as a widespread phenomenon in mammals (e.g. the nice work by Jin Billy Li in the Church lab); the possibility that cancers can hijack this for nefarious purposes should be an interesting avenue to explore. COG3 is a Golgi protein & links of the Golgi to cancer are starting to be teased out. SRP9 is part of the signal recognition particle involved in protein translocation into the ER -- which of course feeds the Golgi. Quite possibly this is coincidental, but it certainly rates investigating.

One final thought: the next year will probably be filled with a lot of similar papers. Cancer genomics is gearing up in a huge way, with Wash U alone planning 150 genomes well before a year from now. It seems unlikely that those 150 genomes will end up as 150 distinct papers and more so it will be a challenge to do the level of follow-up in this paper on such a grand scale. A real challenge to the experimental community -- and the funding establishment -- is converting the tantalizing observations which will come pouring out of these studies into validated biological findings. With a little luck, biotech & pharma companies (such as my employer) will be able to convert those findings into new clinical options for doctors and patients.

Sohrab P. Shah, Ryan D. Morin, Jaswinder Khattra, Leah Prentice, Trevor Pugh, Angela Burleigh, Allen Delaney, Karen Gelmon, Ryan Guliany, Janine Senz, Christian Steidl, Robert A. Holt, Steven Jones, Mark Sun, Gillian Leung, Richard Moore, Tesa Severson, Greg A. Taylor, Andrew E. Teschendorff, Kane Tse, Gulisa Turashvili, Richard Varhol, René L. Warren, Peter Watson, Yongjun Zhao, Carlos Caldas, David Huntsman, Martin Hirst, Marco A. Marra, & Samuel Aparicio (2009). Mutational evolution in a lobular breast tumor profiled at single nucleotide resolution Nature, 461, 809-813 : 10.1038/nature08489