Monday, December 28, 2009

Length matters!

I was looking through part of my collection of papers using Illumina sequencing and discovered an unpleasant surprise: more than one does not seem to state the read length used in the experiment. While to some this may seem trivial, I had a couple of interests. First, it's useful for estimating what can be done with the technology, and second since read lengths have been increasing it is an interesting guesstimate of when an experiment was done. Of course, there are lots of reasons to carefully pick read length -- the shorter the length, the sooner the instrument can be turned over to another experiment. Indeed, a recent paper estimates that for RNA-Seq IF you know all the transcript isoforms then 20-25 nucleotides is quite sufficient and you are interested in measuring transcript levels (they didn't, for example, discuss the ideal length for mutation/SNP discovery). Of course, that's a whopping "IF", particularly for the sorts of things I'm interested in.

Now in some cases you can back-estimate the read length using the given statistics on numbers of mapped reads and total mapped nucleotides, though I'm not even sure these numbers are reliably showing up in papers. I'm sure to some authors & reviewers they are tedious numbers of little use, but I disagree. Actually, I'd love to see each paper (in the supplementary materials) show their error statistics by read position, because this is something I think would be interesting to see the evolution of. Plus, any lab not routinely monitoring this plot is foolish -- not only would a change show important quality control information, but it also serves as an important reminder to consider the quality in how you are using the data. It's particularly surprising that the manufacturers do not have such plots prominently displayed on their website, though of course those would be suspected of being cherry-picked. One I did see from a platform supplier had a horribly chosen (or perhaps deviously chosen) scale for the Y-axis, so that the interesting information was so compressed as to be nearly useless.

I should have a chance in the very near future to take a dose of my own prescription. On writing this, it occurs to me that I am unaware of widely-available software to generate the position-specific mismatch data for such plots. I guess I just gave myself an action item!

Friday, December 18, 2009

Nano Anglerfish or Feejee Mermaids?

A few months ago I blogged enthusiastically about a paper in Science describing an approach to deorphan enzymes in parallel. Two anonymous commenters were quite derisive, claiming the chemistry for generating labeled metabolites in the paper impossible. Now Science's editor Bruce Alberts has published an expression of concern, which cites worries over the chemistry as well as the failure of the authors to post promised supporting data to their website and changing stories as to how the work was done.

The missing supporting data hits a raw nerve. I've been frustrated on more than one occasion whilst reviewing a paper that I couldn't access their supplementary data, and have certainly encountered this as a reader as well. I've sometimes meekly protested as a reviewer; in the future I resolve to consider this automatic grounds for "needs major revision". Even if the mistake is honest, it means day considered important is unavailable for consideration. Given modern publications with data which is either too large to print or simply incompatible with paper, "supplementary" data is frequently either central to the paper or certainly just off center.

This controversy also underscores a challenge for many papers which I have faced as a reviewer. To be quite honest, I'm utterly unqualified to judge the chemistry in this paper -- but feel quite qualified to judge many of the biological aspects. I have received for review papers with this same dilemma; parts I can critique and parts I can't. The real danger is if the editor inadvertantly picks reviewers who all share the same blind spot. Of course, in an ideal world a paper would always go to reviewers capable of vetting all parts of it, but with many multidisciplinary papers that is unlikely to happen. However, it also suggests a rethink of the standard practice of assigning three reviewers per paper -- perhaps each topic area should be covered by three qualified reviewers (of course, the reviewers would need to honestly declare this -- and not at review deadline time when it is too late to find supplementary reviewers!).

But, it is a mistake to think that peer review can ever be a perfect filter on the literature. It just isn't practical to go over every bit of data with a fine toothed comb. A current example illustrates this: a researcher has been accused of faking multiple protein structures. While some suspicion was raised when other structures of the same molecule didn't agree, the smoking gun is that the structures have systematic errors in how the atoms are packed. Is any reviewer of a structure paper really going to check all the atomic packing details? At some point, the best defense against scientific error & misconduct is to allow the entire world to scrutinize the work.

One of my professors in grad school had us first year students go through a memorable exercise. The papers assigned one week were in utter conflict with each other. We spent the entire discussion time trying to finesse how they could both be right -- what was different about the experimental procedures and how issues of experiment timing might explain the discrepancies. At the end, we asked what the resolution was, and was told "It's simple -- the one paper is a fraud". Once we knew this, we went back and couldn't believe we had believed anything -- nothing in the paper really supported its key conclusion. How had we been so blind before? A final coda to this is that the fraudulent paper is the notorious uniparental mouse paper -- and of course cloning of mice turns out to actually be possible. Not, of course, by the methods originally published and indeed at that time (mid 1970s) it would be well nigh impossible to actually prove that a mouse was cloned.

With that in mind, I will continue to blog here about papers I don't fully understand. That is one bit of personal benefit for me -- by exposing my thoughts to the world I invite criticism and will sometimes be shown the errors in my thinking. It never hurts to be reminded that skepticism is always useful, but I'll still take the risk of occasionally being suckered by P.T. Barnum, Ph.D.. This is, after all, a blog and not a scientific journal. It's meant to be a bit noisy and occasionally wrong -- I'll just try to keep the mean on the side of being correct.

Thursday, December 17, 2009

A Doublet of Solid Tumor Genomes

Nature this week published two papers describing the complete sequencing of a cancer cell line (small cell lung cancer (SCLC) NCI-H209 and melanoma COLO-829) each along with a "normal" cell line from the same individual. I'll confess a certain degree of disappointment at first as these papers are not rich in the information of greatest interest to me, but they have grown on me. Plus, it's rather churlish to complain when I have nothing comparable to offer myself.

Both papers have a good deal of similar structure, perhaps because their author lists share a lot of overlap, including the same first author. However, technically they are quite different. The melanoma sequencing used the Illumina GAII, generating 2x75 paired end reads supplemented with 50x2 paired end reads from 3-4Kb inserts, whereas the SCLC paper used 2x25 mate pair SOLiD libraries with inserts between 400 and 3000 bp.

The papers have estimates of the false positive and false negative rates for the detection of various mutations, in comparison to Sanger data. For single base pair substitutions on the Illumina platform in the melanoma sample, 88% of previously known variants were found and 97% of a sample of 470 newly found variants confirmed by Sanger. However, on small insertion/deletion (indel) there was both less data and much less success. Only one small deletion was previously known, a 2 base deletion which is key to the biology. This was not found by the automated alignment and analysis, though reads containing this indel could be found in the data. A sample of 182 small indels were checked by Sanger and only 36% were confirmed. On large rearrangements, 75% of those tested confirmed by PCR.

The statistics for the SOLiD data in SCLC were comparable. 76% of previously known single nucleotide variants were found and 97% of newly found variants confirmed by Sanger. Two small indels were previously known and neither was found and conversely only 25% of predicted indels confirmed by Sanger. 100% of large rearrangements tested by PCR validated. So overall, both platforms do well for detecting rearrangements and substitutions and are very weak for small indels.

The overall mutation hauls were large, after filtering out variants found in the normal cell line. 22,910 substitutions for the SCLC line and 33,345 in the melanoma line. Both of these samples reflect serious environmental abuse; melanomas often arise from sun exposure and the particular cancer morphology the SCLC line is derived from is characteristic of smokers (the smoking history of the patient was unknown). Both lines showed mutation spectra in agreement with what is previously known about these environmental insults. 92% of C>T single substitutions occured at the second base of a pyrimidne dimers (CC or CT sequences). CC>TT double substitutions were also skewed in this manner. CpG dinucleotides are also to be hotspots and showed elevated mutation frequencies. Transcription-coupled repair repairs the transcribed strand more efficiently than the non-transcribed strand, and in concordance with this in transcribed regions there was nearly a 2:1 bias of C>T changes on the non-transcribed strand. However, the authors state (but I still haven't quite figured out the logic) that transcription-coupled repair can account for only 1/3 of the bias and suggest that another mechanism, previously suspected but not characterized, is at work. One final consequence of transcription-coupled repair is that the more expressed a gene is in COLO-829, the lower its mutational burden. A bias of mutations towards the 3' end of transcribed regions was also observed, perhaps because 5' ends are transcribed at higher levels (due to abortive transcription). A transcribed-strand bias was also seen in G>T mutations, which may be oxidative damage.

An additional angle on mutations in the COLO-829 melanoma line is offered by the observation of copy-neutral loss of heterozygosity (LOH) in some regions. In other words, one copy of a chromosome was lost but then replaced by a duplicate of the remaining copy. This analysis is enabled by having the sequence of the normal DNA to identify germline heterozygosity. Interestingly, in these regions heterzyogous mutations outnumber homozygous ones, marking that these substitutions occurred after the reduplication event. 82% of C>T mutations in these regions show the hallmarks of being early mutations, suggesting they occured late, perhaps after the melanoma metastasized and was therefore removed from ultraviolet exposure.

In a similar manner, there is a rich amount of information in the SCLC mutational data. I'll skip over a bunch to hit the evidence for a novel transcription-coupled repair pathway that operates on both strands. The key point is that highly expressed genes had lower mutation rates on both strands than less expressed genes. A>G mutations showed a bias for the transcribed strand whereas G>A mutations occured equally on each strand.

Now, I'll confess I don't generally get excited about looking a mutation spectra. A lot of this has been published before, though these papers offer a particulary rich and low-bias look. What I'm most interested in are recurrent mutations and rearrangements that may be driving the cancer, particularly if they suggest therapeutic interventions. The melanoma line contained two missense mutations in the gene SPDEF, which has been associated with multiple solid tumors. A truncating stop mutation was found by sequencing SPDEF out of 48 additional tumors. A missense change was found in a metalloprotease (MMP28) which has previously been observed to be mutated in melanoma. Another missense mutation was found in agene which may play a role in ultraviolet repair (though it has been implicated in other processes), suggesting a tumor suppressor role. The sequencing results confirmed two out of three known driver mutations in COLO-829: the V600E activating mutation in kinase BRAF and deletion of the tumor suppressor PTEN. As noted above, the know 2 bp deletion in CDKN2A was not found through the automated process.

The SCLC sample has a few candidates for interestingly mutated genes. A fusion gene in which one partner (CREBBP) has been seen in leukemia gene fusions was found. An intragenic tandem duplication within the chromatin remodelling gene CHD7 was found which should generate an in-frame duplication of exons. Another SCLC cell line (NCI-H2171) was previously known to have a fusion gene involving CHD7. Screening of 63 other SCLC cell lines identified another (LU-135) with internal exon copy number alterations. Lu-135 was further explored by mate pair sequencing witha 3-4Kb library, which identified a breakpoint involving CHD7. Expression analysis showed high expression levels of CHD7 in both LU-135 and NCI-H2171 and a general higher expression of CHD7 in SCLC lines than non-small cell lung cancer lines and other tumor cell lines. An interesting twist is that the fusion partner in NCI-H2171 abd KY-135 is a non-coding RNA gene called PVT1 -- which is thought to be a transcriptional target of the oncogene MYC. MYC is amplified in both these cell lines, suggesting multiple biological mechanisms resulting in high expression of CHD7. It would seem reasonable to expect some high profile functional studies of CHD7 in the not too distant future.

For functional point mutations, the natural place to look is at coding regions and splice junctions, as here we have the strongest models for ranking the likelihood that a mutation will have a biological effect. In the SCLC paper an effort was made to push this a bit further and look for mutations that might affect transcription factor binding sites. One candidate was found but not further explored.

In general, this last point underlines what I believe will be different about subsequent papers. Looking mostly at a single cancer sample, one is limited at one can be inferred. The mutational spectrum work is something which a single tumor can illustrate in detail, and such in depth analyses will probably be significant parts of the first tumor sequencing paper for each tumor type, particularly other types with strong environmental or genetic mutational components. But, in terms of learnign what make cancers tick and how we can interfere with that, the real need is to find recurrent targets of mutation. Various cancer genome centers have been promising a few hundred tumors sequenced over the next year. Already at the recent ASH meeting (which I did not attend), there were over a half dozen presentations or posters on whole genome or exome sequencing of leukemias, lymphomas and myelomas -- the first ripples of the tsunami to come. But, the raw cost of targeted sequencing remains at most a 10th of the cost of an entire genome. The complete set of mutations found in either one of these papers could have been packed onto a single oligo based capture scheme and certainly a high-priority subset could be amplified by PCR without breaking the bank on oligos. I would expect that in the near future tumor sequencing papers will check their mutations and rearrangements on validation panels of at least 50 and preferable hundreds of samples (though assembling such sample collections is definitely not trivial). This will allow the estimation of the population frequency of those mutations which may recur at the level of 5-10% or more. With luck, some of those will suggest pharmacologic interventions which can be tested for their ability to improve patients' lives.
Pleasance, E., Stephens, P., O’Meara, S., McBride, D., Meynert, A., Jones, D., Lin, M., Beare, D., Lau, K., Greenman, C., Varela, I., Nik-Zainal, S., Davies, H., Ordoñez, G., Mudie, L., Latimer, C., Edkins, S., Stebbings, L., Chen, L., Jia, M., Leroy, C., Marshall, J., Menzies, A., Butler, A., Teague, J., Mangion, J., Sun, Y., McLaughlin, S., Peckham, H., Tsung, E., Costa, G., Lee, C., Minna, J., Gazdar, A., Birney, E., Rhodes, M., McKernan, K., Stratton, M., Futreal, P., & Campbell, P. (2009). A small-cell lung cancer genome with complex signatures of tobacco exposure Nature DOI: 10.1038/nature08629

Pleasance, E., Cheetham, R., Stephens, P., McBride, D., Humphray, S., Greenman, C., Varela, I., Lin, M., Ordóñez, G., Bignell, G., Ye, K., Alipaz, J., Bauer, M., Beare, D., Butler, A., Carter, R., Chen, L., Cox, A., Edkins, S., Kokko-Gonzales, P., Gormley, N., Grocock, R., Haudenschild, C., Hims, M., James, T., Jia, M., Kingsbury, Z., Leroy, C., Marshall, J., Menzies, A., Mudie, L., Ning, Z., Royce, T., Schulz-Trieglaff, O., Spiridou, A., Stebbings, L., Szajkowski, L., Teague, J., Williamson, D., Chin, L., Ross, M., Campbell, P., Bentley, D., Futreal, P., & Stratton, M. (2009). A comprehensive catalogue of somatic mutations from a human cancer genome Nature DOI: 10.1038/nature08658

Monday, December 14, 2009

Panda Genome Published!

Posted by Picasa

Today's big genomics news is the advance publication in Nature of the giant panda (aka panda bear) genome sequence. For I'll be fighting someone (TNG) for my copy of Nature!

Pandas are the first bear (and alas, there is already someone making the mistaken claim otherwise in the Nature online comments) and only second member of Carnivora (after dog) with a draft sequence. Little in the genome sequence suggests that they have abandoned meat for a nearly all-plant diet, other than an apparent knockout of the taste receptor for glutamate, a key component of the taste of meat. So if you prepare bamboo for the pandas, don't bother with any MSG! But pandas do not appear to have acquired enzymes for attacking their bamboo, suggesting that their gut microflora do a lot of the work. So a panda microbiome metagenome project is clearly on the horizon. The sequence also greatly advances panda genetics: only 13 panda genes were previously sequenced.

The assembly is notable for being composed entirely of Solexa data using a mixture of library insert lengths. One issue touched on here (and I've seen commented on elsewhere) is that the longer mate pair libraries have serious chimaera issues and were not trusted to simply be fed into the assembly program, but were carefully added in a stepwise fashion (stepping up in library length) during later stages of assembly. It will be interesting to see what the Pacific Biosciences instrument can do in this regard -- instead trying to edit out the middle of large inserts by enzymatic and/or physical means, PacBio apparently has a "dark fill" procedure of pulsing unlabeled nucleotides. This leads to islands of sequence separated by signal gaps of known time, which can be be used to estimate distance. Presumably such an approach will not have chimaeras though the raw base error rate may be higher.

I'm quite confused by their Table 1, which shows the progress of their assembly as different data was added in. The confusing part is that it shows the progressive improvement in the N50 and N90 numbers with each step -- and then much worse numbers for the final assembly. The final N50 is 40Kb, which is substantially shorter than dog (close to 100Kb) but longer than platypus (13 kb). It strikes me that a useful additional statistic (or actually set of statistics) for a mammalian genome would be to calculste what fraction of core mammalian genes (which would have to be defined) are contained on a single contig (or for what fraction will you find at least 50% of the coding region in one contig).

While the greatest threat to panda's continuing existence in the wild is habitat destruction, it is heartening to find out that pandas have a high degree of genetic variability -- almost twice the heterozygosity of people. So there is apparently a lot of genetic diversity packed into the small panda population (around 1600 individuals, based on DNA sampling of scat)

BTW, no that is not the subject panda (Jingjing, who was the mascot for the Beijing Olympics) but rather my shot from our pilgrimage last summer to the San Diego Zoo. I think that is Gao Gao, but I'm not good about noting such things.

(update: forgot to put the Research Blogging bit in the post)
Li, R., Fan, W., Tian, G., Zhu, H., He, L., Cai, J., Huang, Q., Cai, Q., Li, B., Bai, Y., Zhang, Z., Zhang, Y., Wang, W., Li, J., Wei, F., Li, H., Jian, M., Li, J., Zhang, Z., Nielsen, R., Li, D., Gu, W., Yang, Z., Xuan, Z., Ryder, O., Leung, F., Zhou, Y., Cao, J., Sun, X., Fu, Y., Fang, X., Guo, X., Wang, B., Hou, R., Shen, F., Mu, B., Ni, P., Lin, R., Qian, W., Wang, G., Yu, C., Nie, W., Wang, J., Wu, Z., Liang, H., Min, J., Wu, Q., Cheng, S., Ruan, J., Wang, M., Shi, Z., Wen, M., Liu, B., Ren, X., Zheng, H., Dong, D., Cook, K., Shan, G., Zhang, H., Kosiol, C., Xie, X., Lu, Z., Zheng, H., Li, Y., Steiner, C., Lam, T., Lin, S., Zhang, Q., Li, G., Tian, J., Gong, T., Liu, H., Zhang, D., Fang, L., Ye, C., Zhang, J., Hu, W., Xu, A., Ren, Y., Zhang, G., Bruford, M., Li, Q., Ma, L., Guo, Y., An, N., Hu, Y., Zheng, Y., Shi, Y., Li, Z., Liu, Q., Chen, Y., Zhao, J., Qu, N., Zhao, S., Tian, F., Wang, X., Wang, H., Xu, L., Liu, X., Vinar, T., Wang, Y., Lam, T., Yiu, S., Liu, S., Zhang, H., Li, D., Huang, Y., Wang, X., Yang, G., Jiang, Z., Wang, J., Qin, N., Li, L., Li, J., Bolund, L., Kristiansen, K., Wong, G., Olson, M., Zhang, X., Li, S., Yang, H., Wang, J., & Wang, J. (2009). The sequence and de novo assembly of the giant panda genome Nature DOI: 10.1038/nature08696