Monday, September 28, 2009

Locking in new functions

The September 24th Nature came in the mail today and as always with this journal (otherwise I wouldn't pay for it!) is full of interesting stuff. One paper of particular interest is a cool merger of evolution, computational biology, structural biology and protein engineering.

An interesting question in evolution is to what degree are changes reversible. In the simplest case, of purely neutral characteristics, the answer would seem to be largely that they are. However, even a purely neutral change will have a certain probability of reverting. For example, since transversions (mutation of a pyrimidine to a purine or vice versa) are less common than transitions (purine->purine or pyrimidine->pyrimidine), a C->G mutation (transversion) is less likely to return to C than a C->T (transition). Similarly, if a C is methylated but that methylation serves no purpose, the methylation will favor conversion to a T, but the T has no such biochemical slanting to mutate to a C. But even these will be small changes.

But throw in some function, and the question gets more complicated. The question that this paper addresses is a specific receptor, the glucocorticoid receptor. A previous paper by the group showed that the inferred ancestral form was promiscuous, primarily bound some related steroids, but did have some affinity for glucocorticoids. This ancestral form existed in the last common ancestor of cartilaginous and bony fishes but by the time of the last common ancestor for bony fishes and tetrapods (such as us) it had fixed a specificity for corticosteroids. These inferred ancestral receptors are referred to respectively as AncGR1 and AncGR2.

While there are 37 amino acid replacements between AncGR1 and AncGR2, it takes only two of these (group X) to switch the preference of AncGR1 to corticoidsteroids. The change is accomplished by substantially swinging a helix to a new position in the ligand binding pocket (helix 7) Only three more substitutions (group Y) enforce specificity for corticosteroids; make all 5 of these changes and you convert a promiscuous receptor with weak activity towards corticosteroids to one activated only by them. But the interesting kicker is you can't make this second set of specificity-locking mutations until 2 other mutations (group Z) are made. The issue is that the first two X mutations cause a significant structural shift which is not entirely stable; without the stability of the group Z pair of mutations the group Y specificity trio can't be tolerated.

But, there's a kicker. If you engineer the AncGR2 protein back to having the ancestral states for groups X, Y and Z, the resulting protein is non-functional for any ligand. Something is going on somewhere in those other 30 changes. Some further phylogenetic filtering suggested 6 strong candidates and the solution of the X-ray structure of the AncGR2 ligand binding domain (though it turns out the prior homology model of this structure was apparently almost dead on). Five of the candidates (group W) turn out to either be in or to contact that swung helix 7. The structure of AncGR1 had been previously solved and a comparison of the AncGR1 and AncGR2 structures showed that the ancestral (AncGR1) forms at these 5 positions stabilize the ancestral position of helix 7 and the derived (AncGR2) amino acids at these positions actually clash with the AncGR1 positioning of helix 7. Aynthesis of AncGR2 with the ancestral amino acids at groups X, Y, Z and W yielded a receptor whose specificity is very like AncGR1. One group W substitution had a strong enough effect it could imbue the ancestral phenotype even without the other group W changes but some of the other group W changes could be made only in pairs to show an effect. Finally, receptors with the ancestral state for combinations of x, y and z mutations (e.g. combining with Xyz -- AncGR2 for X but AncGR1-like at y and z) and found that any combination with xW is non-functional. AncGR2 with ancestral amino acids at x,y,z & w is not as good a receptor as AncGR1 -- suggesting that at least some of the remaining 25 positions contribute.

So, this is a well-detailed case where evolutionary change eventually blocked the route back to the start. A receptor which made the group X changes could still bind the original ligands but that would be lost once the group Y changes were layered on. Group Y changes were probably preceded by group Z changes which would have made reversion to the original binding specificity unlikely -- and the group W mutations really nail shut the door.

This particular system was a single polypeptide chain. But it is not difficult to see how the concept could extend to other biological systems. Co-evolution of interacting proteins, such as a protein and its receptor, or modification of a developmental system could similarly proceed in a stepwise fashion that ultimately prevents retreat. We are a bit lucky in this case that the evolutionary traces are all preserved where we can find them; it is not difficult to imagine a scenario where part of the ancestral form is lost from all extant lineages and therefore invisible to our current vision.
Bridgham JT, Ortlund EA, & Thornton JW (2009). An epistatic ratchet constrains the direction of glucocorticoid receptor evolution. Nature, 461 (7263), 515-9 PMID: 19779450

I'll probably add to my spam issues by pointing this out, but this

Friday, September 25, 2009

How many genomes did I just squash?

Yesterday was a good day for catching up on the literature; not only did I finally get around to the IL28B papers I blogged about yesterday, but I also took a run through the genome fusion paper which is being seen as the fitting marker of the end of the "Communicated by" mechanism of PNAS (sample coverage by In The Pipeline and Science, though the latter requires a subscription).

The paper, by Donald Williamson and communicated by Lynn Margulis, takes the position that " in animals that metamorphose, the basic types of larvae originated as adults of different lineages, i.e., larvae were transferred when, through hybridization, their genomes were acquired by distantly related animals". This is a whopper of a proposal and definitely interesting.

Margulis is famous for proposing the endosymbiont hypothesis to explain mitochondria and chloroplasts and other organelles. The gist of it is that some ancestral eukaryote took in a guest species and in the long run integrated it fully into its operations so that the two could not be separated. An important observation which this explained is the fact that mitochondria and chloroplasts have their own genomes, which encode (almost?) entirely for proteins and RNAs used in these structures. However, their genomes do not encode many of the proteins required -- indeed in metazoans such as ourselves only a tiny pittance of genes are encoded by the mitochondrial genome. A further observation which fits into this framework is the curious case of Cyanophora paradoxa, a photosynthetic organism whose chloroplast-like structure is surrounded by a rudimentary cell wall.

When I was an undergraduate, there was still significant controversy on the validity of the endosymbiont hypothesis. I remember this well, as I wrote a term paper on the subject. What really nailed it down was the careful comparison of gene trees in the cases where the same function is required both in the organelle and in the cytoplasm and both are nuclear encoded. In the vast majority of these cases, the two are evolutionarily distant from one another and in the case of chloroplasts the gene whose protein goes to the chloroplast looks more like homologs in cyanobacteria and the copy producing cytoplasmic protein looks more like homologs in non-photosynthetic eukaryotes. There are some fascinating exceptions, such as cases in which one gene does double duty -- via (for example) alternative splicing or promoters including or excluding the chloroplast targeting sequences.

Margulis and others have tried to extend this notion to other systems. There are definitely other success -- unicellular organisms which appear to carry three genomes & the always challenging to classify Euglena, which appears to be a genome fusion. But there have also been some prominent non-successes, such as the eukaryotic flagellum/cillium. Also when I was an undergraduate a Cell paper made a big splash claiming to find a chromosome associated with the basal body, the organelle associated with flagellum synthesis. However, this work was never repeated and the publication of the Chlamydomas genome failed to find such a chromosome.

After reading the paper at hand, I'm both confused and disappointed. The confusion is embarassing, but the paper goes into a lot of detail on taxonomy and gross development of which I'm horribly ignorant. But, conversely the disappointment comes from what I do understand and how cursorily that is treated. And since it is the stuff I understand which is the route Williamson proposes to test his hypothesis, that is a big let down.

A key part that I do understand (minus a few terms I hadn't encountered before), with my emphasis:
Many corollaries of my hypothesis are testable. If insects acquired larvae by hybrid transfer, the total base pairs of DNA of exopterygote insects that lack larvae will be smaller than those of endopterygote (holometabolous) species that have both larvae and pupae. Genome sequences are known for the fruitfly, Drosophila melanogaster, the honeybee, Apis mellifera, the malarial mosquito, Anopheles gambiae, the red flour beetle, Tribolium castaneum, and the silkworm, Bombyx mori: holometabolous species, with marked metamorphoses. I predict that an earwigfly (Mercoptera Meropeidae), an earwig (Dermaptera), a cockroach (Dictyoptera), or a locust (Orthoptera) will have not necessarily fewer chromosomes but will have fewer base pairs of protein-coding chromosomal DNA than have these holometabolans. Also the genome of an onychophoran that resembles extant species will be found in insects with caterpillar or maggot-like larvae. Onychophoran genomes will be smaller than those of holometabolous insects. Urochordates, comprising tunicates and larvaceans, present a comparable case. Larvaceans are tadpoles throughout life. Garstang regarded larvaceans as persistent
tunicate larvae, and, if so, their genomes would resemble those of tunicates. But if larvaceans provided the evolutionary source of marine tadpole larvae, their genomes would be smaller and included in those of adult tunicates. The genome of the larvacean Oikopleura dioica is about one-third that of the tunicate Ciona intestinalis, consistent with my thesis

Williamson is obviously not an expert on genomics, but Margulis should have known better and pushed him to improve this section. In the "communicated by" path, the academy member can basically hand-pick the reviewers and is supposed to act as an editor would.

The first problem is a rather naive view of genome size and evolution. Genome sizes vary all over the map even within related species; Fugu to salmon is several fold as is fruit fly to malaria vector. The latter pair is particularly relevant since these are both dipteran insects, and therefore in the same bin by Williamson's standard (as stated in the quoted text). Now, that is overall genome size; if you restrict to protein coding regions these pairs are more similar, which leaves some wiggle room. But, by the same token the Oikopleura and Ciona genomes contain about the same number of genes (~15-16K).

But furthermore, his hypothesis should be quite testable right now, at least in a basic form. If a genome fusion occurred, then genes active in larval stages and genes active in the adult should show different gene trees if they are homologs. Given that there is a lot of data to annotate which Drosophila genes are active when, this should be a practical exercise. While I leave this as an exercise for the student, I would point out that it is already known that in Drosophila many proteins are active in both phases. This can probably also be tallied in some fashion. I'm guessing that the fraction of genes shared between stages will be quite large, which would not be very supportive of the fusion hypothesis.

Should a paper like this get into a journal such as PNAS? Given what I've written above, I think not, simply on its demerits. On the other hand, crazy hypotheses do need a place to go because they are sometimes the right hypotheses -- Margulis's formulation of endosymbiont hypothesis had very tough sledding on its path to the textbooks. However, in the modern world there is a place for odd speculations and journeying outside your expertise. It's called a blog!
Williamson DI (2009). Caterpillars evolved from onychophorans by hybridogenesis. Proceedings of the National Academy of Sciences of the United States of America PMID: 19717430

Thursday, September 24, 2009

Unwarranted pessimism on IL28A/B & HCV?

I finally got around to reading the Nature News & Views article by Iadonato and Katze summarizing and opining on the recent quartet of papers linking genetic variation around IL26B and the response to standard therapy for Hepatitis C Virus. The N&V has at least one glaring flaw and also (IMHO) goes down the cliched route of concluding that the result will be clinically useless.

The four GWAS studies found the same cluster of SNPs around IL28B, nicely cross-validating the studies. One curious statement in the N&V is
Although all of the identified variants in the three studies lie in or near the IL28B gene, none of them has an obvious effect on the function of this gene, which encodes interferon-3, a growth factor with similarities to the interferon- preparations used as treatment.

Two of the papers provide direct evidence as to at least one effect of these SNPs; one showed that the SNPs are linked to the expression of both IL28B and the nearby related gene IL28A; the other looked only at IL28B. Lower expression of these loci was correlated with the genotype with worse prognosis.

The N&V goes on with some boilerplate pessimism about GWAS studies impact on medicine
The question remains, however, as to how readily these and other observations from GWAS can be translated into meaningful changes in patient care. The field of human genetics has described many associations between specific mutations and medically important outcomes, but rarely have these observations resulted in new therapies to treat disease or in major shifts in existing treatments. This failure is exemplified by the lack of clinical benefit that followed the cloning in 1989 of the gene responsible for cystic fibrosis11 — the first example of the use of molecular genetics to discover the cause of an otherwise poorly understood condition. Although some progress has been made in treating patients with cystic fibrosis, in the ensuing 20 years neither of the two newly approved drugs for this condition were developed using knowledge of the gene mutations that cause it. Apart from a few well-characterized beneficial mutations (for example, those resulting in resistance to HIV infection), genetics has been an inefficient tool for drug discovery.

So although these findings raise the tantalizing prospect of a more personalized approach to treating HCV by tailoring treatment to patients who are most likely to benefit, the reality is more sobering. Diagnostic testing to identify likely responders to interferon may be a future possibility, but clinical decision-making will be clouded by the fact that the effect of the advantageous variant is not absolute — not all carriers of the variant clear the virus, nor do all patients lacking the variant fail to benefit from treatment. Furthermore, there is currently no alternative to interferon therapy for the HCV-infected population.
. They also pile on with graphs showing the exponential growth of Genbank and dbSNP vs. the flat numbers for INDs (new drugs into trials) and NMEs (new approvals).

Of course, I could respond with the boilerplate response (found in at least one of the papers) that patients with the "poor response" genotype. And indeed, new HCV therapies are in the pipeline, perhaps most prominently a compound under development by Vertex. Understanding if these variants affect response to the new compounds now becomes an important research question.

But, it's also stunning that the N&V authors didn't suggest a rather obvious approach suggested by these papers. Not only do patients with the "high expression" genotype respond better to therapy, but this genotype also predicts spontaneous clearance of the virus. Furthermore, these loci encode secreted immune factors. So to me at least, this can be viewed as a classic protein replacement therapy candidate -- a subset of patients produce too little of a natural protein (or two natural proteins) and providing them with recombinant protein might provide therapeutic benefit. I suspect that whatever companies hold patent claims on IL28A & IL28B are contemplating just such a strategy. This is also in stark contrast to cystic fibrosis, where the affected protein is damaged rather than underexpressed and is a membrane protein not a secreted protein. By focusing on the general difficulty of converting genetic information to therapy rather than the specific circumstances of these papers, the N&V authors completely blew it.

IL28A & IL28B loci produce proteins classified as interferons and it is another interferon (alpha) which is a key part of the standard therapy. A more extreme version (or a bit of the flip side) of the protein shortage theory would posit that the sum of the interferons is important for response -- and perhaps also for side effects. If this were the case, then increasing the dose of alpha interferon in the "low expression" genotype (or better yet, actually typing patients white cells for expression of these proteins) might be a reasonable clinical approach. Given that interferon alpha is already approved, this is the sort of clinical experimentation that goes on all the time.

Yet another angle suggested by the "IL28A/B deficiency hypothesis" is that a viable therapeutic discovery approach is to find compounds which increase expression of IL28A and/or IL28B in leukocytes. This has been a successful strategy for generating new therapeutic hypotheses in oncology. Better yet, hints may already exist -- some enterprising student should search the Broad's Connection Map or other databases of expression data for cell lines treated with compounds to identify compounds which upregulate IL28A/B transcripts. A hit in such a search or a broader screen of already approved compounds could potentially rapidly lead to clinical experiments.

The one time I had an opportunity to write a N&V (as a grad student) I got writer's block and missed the boat. It will always irk me. But, perhaps it's better to blow a chance silently rather than write such an awful, unimaginative one which stuck to stock genomics negativity rather than creatively exploring the topic at hand.
Iadonato SP, & Katze MG (2009). Genomics: Hepatitis C virus gets personal. Nature, 461 (7262), 357-8 PMID: 19759611

Ge, D., Fellay, J., Thompson, A., Simon, J., Shianna, K., Urban, T., Heinzen, E., Qiu, P., Bertelsen, A., Muir, A., Sulkowski, M., McHutchison, J., & Goldstein, D. (2009). Genetic variation in IL28B predicts hepatitis C treatment-induced viral clearance Nature, 461 (7262), 399-401 DOI: 10.1038/nature08309

Thomas DL, Thio CL, Martin MP, Qi Y, Ge D, O'huigin C, Kidd J, Kidd K, Khakoo SI, Alexander G, Goedert JJ, Kirk GD, Donfield SM, Rosen HR, Tobler LH, Busch MP, McHutchison JG, Goldstein DB, & Carrington M (2009). Genetic variation in IL28B and spontaneous clearance of hepatitis C virus. Nature PMID: 19759533

the Hepatitis C Study, Suppiah V, Moldovan M, Ahlenstiel G, Berg T, Weltman M, Abate ML, Bassendine M, Spengler U, Dore GJ, Powell E, Riordan S, Sheridan D, Smedile A, Fragomeli V, Müller T, Bahlo M, Stewart GJ, Booth DR, & George J (2009). IL28B is associated with response to chronic hepatitis C interferon-alpha and ribavirin therapy. Nature genetics PMID: 19749758

Tanaka Y, Nishida N, Sugiyama M, Kurosaki M, Matsuura K, Sakamoto N, Nakagawa M, Korenaga M, Hino K, Hige S, Ito Y, Mita E, Tanaka E, Mochida S, Murawaki Y, Honda M, Sakai A, Hiasa Y, Nishiguchi S, Koike A, Sakaida I, Imamura M, Ito K, Yano K, Masaki N, Sugauchi F, Izumi N, Tokunaga K, & Mizokami M (2009). Genome-wide association of IL28B with response to pegylated interferon-alpha and ribavirin therapy for chronic hepatitis C. Nature genetics PMID: 19749757

(ugh: had a serious typo in the title on first posting; now fixed & revised)

Wednesday, September 23, 2009

CHI Next-Gen Conference, Day 3 (final)

Final day of conference, with some serious fatigue setting in (my hotel room was too close to, and faced, a highway. Doh!)

Discovered that I was indeed getting a reputation. Two people I met today asked about my recurrent interest in FFPE (Formalin Fixed, Paraffin Embedded) -- which is how most of the nucleic acids I want to work with are stored. FFPE is notoriously difficult for molecular studeis, with the informational macromolecules having been chemically and physically abused in the fixation process, but it is also famously stable, preserving histological features for years.

Rain Dance sponsored the breakfast & announced that their maximum primer library size has gone up to 20K. To back up, Rain Dance uses microfluidics to create libraries of very tiny (single digit picoliter) droplets in which each droplet contains a primer pair. The precise volume control & normalization of the concentrations means that each primer droplet contains about the same number of oligos, which allows each droplet in a PCR to be run to completion -- meaning that efficient PCRs and inefficient ones in theory end up both having the same number of product molecules. Another set of droplets are created which contain your template DNA, and these are cleverly merged & the whole emulsion cycled. Break up the emulsion & you have lots of PCR amplicons ready to go into a fragmentation protocol. Their movies of droplets marching around, splitting, merging, etc. are dangerously mesmerizing!

Jin Billy Li of the Church group reviewed all the really cool stuff they've done using padlock probes (and confirmed that IP conflicts are retarding indefinitely any commercialization of these). A padlock probe is a long DNA which primes on both sides of a targeted region. Filling the gap between & ligating the gap yields a circle, which can be purified away from any uncircularized DNA and then amplified with universal primers. Turns the multiplex PCR problem into a very diverse set of uniplex PCRs. Various tweaks have substantially improved uniformity, though there is still room for improvement (but the same is true for the hybridization approaches).

Nicolas Bergman presented data on transcriptomic complexity in B.anthracis. I think most of this is published, but I hadn't seen it. A very striking result is that an awful lot (~88%) of transcripts in a supposedly uniform culture are present at much less than 1 copy per cell. He mentioned that small numbers of spores are seen in log cultures, and this might explain it. Also showed that many unannotated genes -- including some that had been truly UNannotated (originally annotated but then removed from the catalogs) are clearly transcribed. Operon structures could be worked out, with 90% matching computational predictions -- and in ~30 testest experimentally by RT-PCR there was 100% concordance.

Epicentre gave an overview of their clever system for fragmenting DNA upstream of either 454 or Illumina. By hijacking a transposase in a clever way, they not only break up the DNA but add on defined sequences. For 454 you then jam on the 454 primers & just get stuck reading 19nt of transposase each time; for Illumina you must use custom sequencing primers.

Eric Wommack & Shawn Polson of University of Delaware (Go Hens!) described work on metagenomics of bacteriophages in seawater. Here's a stunning estimate: if you lined all the world's phages end-to-end, they would stretch 60 light years. Also striking is the high level of bacteriophage-driven turnover of oceanic bacteria -- in about 1/2 to 2 days there is 100% turnover. This is a huge churn of the biochemical space.

Stacey Gabriel gave an update on the Broad's Cancer Genomics effort. Some whole genomes (25 tumor+normal pairs so far) and a lot of exonic sequencing. So far, not a lot of lightning though -- in one study the only thing popping out so far is p53, which is disappointing. Using the Agilent system (developed at the Broad), they can scan 20Kgenes in 1/2 an Illumina run, with 82% of their targeted sequences having at least 14 reads covering.

Matthew Ferber at the Mayo described trying to replace Sanger assays for inherited disorders with 454 and Illumina based approaches. He underscored that this isn't for research -- these are actual diagnostic tests used to determine treatments, such as prophylactic removal of the colon if inherited colon cancer is likely. Capture of the targets on the Nimblegen chips were done and the recovered DNA split to do 454 & Illumina sequencing in parallel. The two next gen approaches came close -- but neither found enough that they could be relied on. Also, some targets are just not recoverable by array capture and would need to be backstopped by something else. One caveat: older technology was used in both cases, so it may be with longer read lengths on both platforms the higher coverage & higher mapping confidence needed would be obtained. On the other hand, some of the mutations were picked to be difficult for the platforms (small indel for Illumina, homopolymer run of >20 for 454) and might remain problems even with more coverage. PCR amplification in place of chip capture is another approach that might improve coverage and get some targets missed by the chip (this is certainly a claim RainDance made in their presentation).

The last talk I took notes on was by Michael Zody on signatures of domestication in chickens. If I had organized things, this would have been just before or after the phage talk! Alas, while the Rhode Island Red was amongst the lines sequenced (apropos the location) Blue Hens were missing -- how could that be? Seriously, the basic design was to sequence pools of DNA from either various domestic chicken lines or the Red Jungle Fowl (representing pre-domestication chicken). Some of these lines were commercial egg layer strains and others commercial broiler (meat) strains. He commented that this level of specification occured very resently (forgot to write down when, but I think it was around a century ago). Two other strains are interesting as they have been selected for about 50 years for one to be very heavy and the other lean -- apparently the heavy line will eat itself silly and the other nearly starves itself. 1 SOLiD slide on each of the 10 pools was used to call out SNPs and various strategies were used to filter out errors in the new data as well as variation due to errors in the reference sequence (in some cases, even typing the reference DNA to demonstrate the need for correction). Reduced heterozygosity was seen around BCDO3, which gives modern chickens their yellow skin (positive control) and also a bunch of other loci -- but those are still under wraps. They also looked for deletions in exons which appear to have been fixed in various lines, and found 1284 which are fixed in one or more domestic lines relative to the Red Jungle Fowl. One interesting one (which is present in the Red Jungle Fowl at low frequency) has gone homozygous (I think; my notes here show fatigue) in the high growth line but is either absent or heterozygous in the low growth line (terrible notes!). It's a 19kb deletion that clips out exons 2-5 (based on the human homolog; there isn't a good transcript sequenced for chicken) and RT-PCR confirms the gene is expressed in the hypothalamus, which has been previously implicated in controlling the feeding behavior.

I took almost no notes on the last talk, looking at dietary influences on gut microbiome (and also, regrettably, had to leave early to make sure I made school night) but it did feature some more "extreme genomics" -- microbiome studies on burmese pythons!

One last thought: sequencing techs represented were either here-and-now (the players you can actually buy) or pretty-distant-future; absent were PacBio and Oxford Nanopore and the host of other companies (save NABSys) announced in the last 3-4 years in this space. Have the others just disappeared quietly or are they in stealth mode? It's hard to imagine the conference would have deliberately snubbed them, which would be a third possibility.

Tuesday, September 22, 2009

CHI Next-Gen Conference, Day 2

I'll confess that in the morning I took notes on only one talk, but the afternoon got back into gear.

The morning talk was by John Quackenbush over at Dana Farber Cancer Institute and covered a wide range of topics. Some was focused on various database approaches to tracking clinical samples but a lot of the talk was on microarrays. He described a new database his group has curated from the cancer microarray literature called GeneSigDb. He also described some work on inferring networks from such data & how it is very difficult to do with no prior knowledge, but with a little bit of network information entered in a lot of other interactions fall out which are known to be real. He also noted that if you look at the signatures collected in GeneSigDb, most human genes are in at least one -- suggesting either cancer affects a lot of genes (probable) and/or a lot of the microarray studies are noisy (certainly!). I did a similar curation at MLNM (whose results were donated to Science Commons when the group dissolved, though I think it never quite emerged from there) & saw the same pattern. I'd lean heavy on "bad microarray studies", as far too many studies on similar diseases come up with disjoint results, whereas there are a few patterns which show up in far too many results (suggesting, for example, that they are signatures of handling cells not signatures of disease). He also described some cool work initiated in another group but followed-up by his group of looking at trajectories of gene expression during the forced differentiation of a cell line. Using two agents that cause the same final differentiated state (DMSO & all-trans retinoic acid), the trajectories are quite different even with the same final state. Some talk at the end of attractors & such.

In the afternoon I slipped over to the "other conference" -- in theory there are two conferences with some joint sessions & a common vendor/poster area, but in reality there isn't much reason to hew to one or the other & good-sounding talks are split between them. I did, alas, accidentally stick myself with a lunch ticket for a talk on storage -- bleah! But, the afternoon was filled with talks on "next next" generation approaches, and despite (or perhaps because of, as the schedule had been cramped) two cancellations, it was a great session.

All but one of the talks at least mentioned nanopore approaches, which have been thought about for close to two decades now. Most of these had some flavor of science fiction to them in my mind, though I'll freely admit the possibility that this reflects more the limitations of my experience than wild claims by the speakers.

One point of (again, genteel) contention between the speakers was around readout technology, with one camp arguing that electrical methods are the way to go, because that is the most semiconductor-like (there is a bit of a cult worship of the semiconductor industry evident at the meeting). Another faction (well, one speaker) argues that optics is better because it can be more naturally multiplexed. Another speaker had no multiplexing in his talk, but that will be covered below

Based on the cluster of questioners (including myself) afterwards, the NABSys talk by John Oliver had some of the strongest buzz. The speaker showed no data from actual reads and was circumspect about a lot of details, but some important ones emerged (at least for me; perhaps I'm the last to know). Their general scheme is to fragment DNA to ~150Kb (well, that's the plan -- so far they go only to 50Kb) and create 384 such pools of single-stranded DNA. Each pool is probed with a set of short (6-10) oligonucleotide probes. Passing a DNA through a machined pore creates a distinct electrical signal for an aligned probe vs. a single stranded region. You can't tell which probe just rode through, but the claim is that by designing the pools carefully and comparing fingerprints you can infer a complete "map" and ultimately a sequence, with some classes of sequence which can't be resolved completely (such as long simple repeats). While no actual data was shown, in conversation the speaker indicated that they could do physical mapping right now, which, I doubt is a big market but would be scientifically very valuable (and yes, I will get back to my series on physical maps & finish it up soon).

Oliver did have a neat trick for downplaying the existing players. It is his contention that any system that can't generate 10^20 bases per year isn't going to be a serious player in medical genomics. This huge figure is arrived at by multiplying the number of cancer patients in the developed world by 100 samples each and 20X coverage. The claim is that any existing player would need 10^8 sequencers to do this (Illumina is approaching 10^3 and SOLiD 10^2). I'm not sure I buy this argument -- there may be value in collecting so many samples per patient, but good luck doing it! It's also not clear that the marginal gain from the 11th sample is really very much (just to pick an arbitrary number). Shave a factor of 10 off there & increase the current platforms by a factor of 10 and, well, you're down to 10^6 sequencers. Hmm, that's still a lot. Anyway, only if the cost gets down to 10s of dollars could national health systems afford any such extravagance.

Another speaker, Derek Stein of Brown University (whose campus I stumbled on today whilst trying to go from my distant hotel to the conference on foot) gave an interesting talk on trying to marry nanopores to mass spec. The general concept is to run the DNA through the pore, break off each nucleobase on the other side & slurp that into the mass spec for readout. It's pretty amazing -- one one side of the membrane a liquid and the other a vacuum! It's just beginning and a next step is to prove that each nucleotide gives a distinct signal. Of course, one possible benefit of this readout is that covalent epigenetic modifications will probably be directly readable -- unless, of course, the modified base has a mass too close to one of the other bases.

Another nanoporist, Amit Meller at Boston University, is back in the optical camp. The general idea here is for the nanopore to strip off probes from a specially modified template. the probes make a rapid fluorescent flash -- they are "molecular beacons" which are inactive when hybridized to template, become unquenched when the come off but then immediately fold unto themselves and quench again. Meller was the only nanopore artist to actually show a read -- 10nt!!! One quirk of the system is that a cyclic TypeIIS digestion & ligation process is used to substitute each base in the original template with 2 bases to give more room for the beacon probes. He seemed to think read lengths of 900 will be very doable and much longer possible.

One other nanopore talk was from Peiming Zhang at Arizona State, who is tackling the readout problem by having some clever molecular probes to interrogate the DNA after it exits the nanopore. He also touched on sequencing-by-hybridization & using atomic microscopy to try to read DNA.

The one non-nanopore talk is one I'm wrestling with my reaction to it. Xiaohua Huang at UCSC described creating a system that marries some of the best features of 454 with some of the features of the other sequencing-by-synthesis systems. His talk helped crystalize in my mind why 454 has such long read lengths but also is a laggard in density space. He attributed the long reads to the fact that 454 uses natural nucleotides rather than the various reversible terminator schemes. But, since pyrosequencing is real-time you get fast reads but the camera must always watch every bead on the plate. In contrast, the other systems can scan the camera across their flowcells, enabling one camera to image many more targets -- but the terminators don't always reverse successfully. His solution is to use 90% natural nucleotides and 10% labeled nucleotides -- but no terminators. After reading one nucleotide, the labels are stripped (he mentioned photobleaching, photolabile tags and chemical removal as all options he is working with) and the next nucleotide flowed in. It will have the same trouble with long mononucleotide repeats as 454 -- but also should have very long read lengths. He puts 1B beads on his plates -- and has some clever magnetic and electric field approaches to jiggle the beads around so that nearly every well gets a bead. In theory I think you could run his system on the Polonator, but he actually built his own instrument.

If I had to rate the approaches by which is most likely to start generating real sequence data, I'd vote for Huang -- but is that simply because it seems more conservative? NABSys talks like they are close to being able to do physical maps -- but will that be a dangerous detour? Or simply too financially uninteresting to attract their attention? The optically probed nanopores actually showed read data -- but what will the errors look like? Will the template expansion system cause new errors?

One minor peeve: pretty much universally, simulations look too much like real data and need more of a scarlet S on them. On the other hand, I probably should have a scarlet B on my forehead, since I've only once warned someone that I blog. One movie today of DNA traversing a nanopore looked very real, but was mentioned later to be simulated. Various other plots were not explained to be simulations until near the end of the presentation of that slide.

Monday, September 21, 2009

CHI Next-Gen Conference, Day 1

Interesting set of talks today. I never did explicitly check on the blogging policy, but given that the session chair kidded a speaker that I would be blogging her live, it wouldn't seem to be a problem. I would honor a ban (particularly since blogging is a bit hard to hide after the fact!), but quite a few folks were photographing slides despite an admonition not to (one person was clearly worried neither about being caught nor being courteous nor being clever; he had his flash on, which is clearly useless for projected images!).

The morning talks ended up as just a trio. The best of the three was Robert Cook-Deegan's talk "So my genome costs less than my bike, what's the big deal?". He obviously has more expensive tastes in bicycles than I do -- or knows a really cheap genome shop! He covered a lot of the ground around what sort of regulatory model will encompass personal genome sequencing. The U.S. weakly and Germany strongly have gone with the model that genome sequencing should be treated like a diagnostic with M.D.s as the absolute gatekeeper (a position which is rather vocally promoted by certain bloggers). Cook-Deegan pointed out something that increasingly worries me, which is that this locks genome sequencing into a very expensive cost model which doesn't improve with scale; you are locking in some very pricey labor that will only increase in price. Cook-Deegan also felt that M.D.s were being picked as the gatekeeper primarily because they are who the regulators are comfortable with historically, not because they are particularly well-trained for the job.

Jonathan Rothberg gave an entertaining talk on his various ventures, which built up to Ion Torrents but where the crescendo was expected by the audience there was instead the request for audience questions. Ion Torrents seems to be a company (Joule is another) which is still trying to be in the public eye without releasing any key information. Understandable, but frustrating.

Henry Erlich gave a nice presentation on using PCR amplification and 454 sequencing to do HLA typing for transplantation. All sorts of advantages to 454 over Sanger here, but cost will probably remain an issue and definitely corral this in very large centers (one 454 run, with multiplexing, can type ~20 samples).

Lunch was given over to IT stuff. CycleComputing presented their bioinformatics-friendly gateway to Amazon's cloud computing stuff (plus some benchmarking). I'll confess to checking email during the presentation on compressing data on servers; far too IT for me.

The afternoon was devoted to a series of presentations by the 6 next gen sequencing platforms with some flavor of being here-and-now: 454, SOLiD, GA2, Helicos, Dover (Polonator) & Complete Genomics. Actually, that was an interesting theme running through some talks, with Illumina saying "we're now gen, not next gen" whereas Complete Genomics calls themselves "third gen". The talks were all genteel but contained pokes at each other.

For example, 454 trumpeted a comparison of two unpublished cucumber genome sequences, one by Illumina+Sanger and one by 454. The 454 16X assembly had a contig N50 of 87Kb vs. 9Kb for a 50X Illumina assembly (no mention made of the amount of paired end data in either, I think -- though now I'm not sure). 454 also declared they've had one perfect read 997 long, though they were open that commercial runs near this are long in the future.

The SOLiD speaker emphasized all the different applications of their technology, using a published graphic that later turned out to have been commissioned by Helicos. Illumina's speaker emphasized the simpler sample prep over emulsion PCR systems (i.e. 454, SOLiD & Polonator).

Helicos promised even simpler sample prep and offered tantalizing hints of good stuff to come -- such as my nemesis of sequencing from FFPE slides. Helicos did detail their paired-end protocol, which is very clever (after reading a bunch of sequence, a set of timed extensions with all 4 nucleotides gives jumps of various distributions which are then followed by more reads. Clearly this will only work with single molecule sequencing, at least in that form (must ponder thought of how to either improve this or get it to work on Illumina-style platform). Helicos also tantalized with a bunch of data from different applications, suggesting that some more publications from this platform are imminent.

Danaher's talk was mostly on details of the instrument, which is the only one actually at the conference & is running. Always fascinated by moving machines, I watched it for a while -- and it demos very nicely, with the stage moving & illuminator flashing & filter wheel spinning. Polonator has very short reads compared to the other platforms, but is promising very low cost which could make it a contender.

Finally, Complete showed off their sequencing center approach. One striking fact is that their read lengths are actually extremely short -- but they extract a quartet of paired short reads. Clearly their recent announced delivery of genomes has improved their credibility & they also detailed some very neat medical genetics results which are presumably going to hit the journals very soon -- in which case they will have complete lab cred. It was pointed out in the discussion panel & in several talks that human sequencing is not the whole world, but even their competitors did not violently object (and therefore seemed to grudgingly acquiesce) that Complete may grab the lion's share of the human genome sequencing market, with the other players going after non-human sequencing or human areas like FFPE or transcriptome sequence where Complete isn't positioning themselves.

Sunday, September 20, 2009

Next-Generation Sequencing Conference, Providence RI, Day 0

I'm going to be at Cambridge HealthTech's Next-Gen Sequencing Conference in exotic Providence RI for the next few days.

I need to check on their blogging policy, though I guessing it is lenient. I probably won't try to write from within sessions but may try to skim some highlights at the end of the day

Thursday, September 17, 2009

A genome too far?

I've crossed fruit flies, tomatoes and yeast, but I would clearly draw the line before taking on this project.

A genetic linkage map for the saltwater crocodile (Crocodylus porosus). BMC Genomics. 2009 Jul 29;10:339

Actually, the Wikipedia entry for the species is a bit reassuring. Don't believe those ridiculous stories of 21 footers -- the largest measured individual was only 20 feet. And only one or two fatal attacks occur each year, and they make headlines in the papers. I'll sleep like a rock tonight!

Tuesday, September 15, 2009

Industrial Protein Production: Further Thoughts

A question raised by a commenter on yesterday's piece about codon optimization is how critical is this for the typical molecular biologist? I think for the typical bench biologist who is expressing small numbers of distinct proteins each year, perhaps the answer is "more critical than you think, but not project threatening". That is, if you are expressing few proteins only rarely will you encounter show-stopping expression problems. That said, with enough molecular biologists expressing enough proteins, some of them will have awful problems expressing some protein of critical import.

But, consider another situation: the high-throughput protein production lab. These can be found in many contexts. Perhaps the proteins are in a structural proteomics pipeline or similar large scale structure determination effort. Perhaps the proteins are to feed into high-throughput screens. Perhaps they are themselves the products for customers or are going into a protein array or similar multi-protein product. Or perhaps you are trying to express multiple proteins simultaneously to build some interesting new biological circuit.

Now, in some cases a few proteins expressing poorly isn't a big deal. The numbers for the project have a certain amount of attrition baked in, or for something like structural proteomics you can let some other protein which did express jump ahead in the queue. However, even with this the extra time and expense of troubleshooting the problem proteins, which can (as suggested by the commenter) be as simple as running multiple batches or can be as complex as screening multiple expression systems and strains, is time and effort that must be accounted for. However, sometimes the protein will be on a critical path and that extra time messes up someone's project plan. Perhaps the protein is the actual human target of your drug or the critical homolog for a structure study. Another nightmare scenario is that the statistics don't average out; for some project you're faced with a jackpot of poor expressors.

This in the end is the huge advantage of predictability; the rarer the unusual events, the smoother a high-throughput pipeline runs and the more reliable its output. So, from this point of view the advantage of the new codon optimization work is not necessarily that you can get huge amounts of proteins, but rather that the unpredictability is ironed out.

But suppose you wanted to go further? Given the enormous space of useful & interesting proteins to express, there will probably be some that become the outliers to the new process. How could you go further?

One approach would be to further tune the tRNA system of E.coli (or any other expression host). For example, there are already special E.coli strains which express some of the extremely disfavored E.coli tRNAs, and these seem to help expression when you can't codon optimize. In theory, it should be possible to create an E.coli with completely balanced tRNA expression. One approach to this would be analyze the promoters of the weak tRNAs and try to rev them up, mutagenizing them en masse with the MAGE technology published by the Church lab.

What else could you do? Expression strains carry all sorts of interesting mutations, often in things such as proteases which can chew up your protein product. There are, of course, all sorts of other standard cloning host mutations enhancing the stability of cloned inserts or providing useful other features. Other important modifications include such things as tightly controlled phage RNA polymerases locked into the host genome.

Another approach is the one commercialized by Scarab Genomics in which large chunks of E.coli have been tossed out. The logic behind this is that many of these deleted regions contain genetic elements which may interfere with stable cloning or genetic expression.

One challenge to the protein engineer or expressionist, however, is getting all the features they want in a single host strain. One strain may have desirable features X and Y but another Z. What is really needed is the technology to make any desirable combination of mutations and additions quickly and easily. The MAGE approach is one step in this direction but only addresses making small edits to a region.

One interesting use of MAGE would be to attempt to further optimize E.coli for high-level protein production. One approach would be to design a strain which already had some of the desired features. A further set of useful edits would be designed for the MAGE system. For a readout, I think GFP fused to something interesting would do -- but a set of such fusions would need to be ready to go. This is so evolved strains can quickly be counter-screened to assess how general an effect on protein production they have. If some of these tester plasmids had "poor" codon optimization schemes, then this would allow the tRNA improvement scheme described above to be implemented. Furthermore, it would be useful to have some of these tester constructs in compatible plasmid systems, so that two different test proteins (perhaps fused to different color variants of GFP) could be maintained simultaneously. This would be an even better way to initially screen for generality, and would provide the opportunity to perform the mirror-image screen for mutations which degrade foreign protein overexpression.

What would be targeted and how? The MAGE paper shows that ribosome binding sites can be a very productive way to tune expression, and so a simple approach would be for each targeted gene to have some strong RBS and weak RBS mutagenic oligos designed. For proteins thought to be very useful, MAGE oligos to tweak their promoters upwards would also be included. For proteins thought to be deleterious, complete nulls could be included via stop-codon introducing oligos. As far as the genes to target, the list could be quite large but would certainly include tRNAs, tRNA synthetases, all of the enzymes involved in the creation or consumption of amino acids, amino acid transporters. The RpoS gene and its targets, which are involved in the response to starvation, are clear candidates as well. Ideally one would target every gene, but that isn't quite in the scope of feasibility yet.

The screen then is to mutagenize via MAGE and select either dual-high (both reporters enhanced in brightness) or dual-low expressors (both reduced in brightness) by cell sorting. After secondary screens, the evolved strains would be fully sequenced to identify the mutations introduced both by design and by chance. Dual-high screens would pull out mutations that enhance expression whereas dual-low would pull out the opposite. Ideally these would be complementary -- genes knocked down in one would have enhancing mutations in the other.

Some of the mutations, particularly spontaneous ones, might be "trivial" in that they simply affect copy number of the expression plasmid. However, even these might be new insights into E.coli biology. And if multiple strains emerged with distinct mutations, a new round of MAGE could be used to attempt to combine them and determine if there are additive effects (or interferences).

Monday, September 14, 2009

Codon Optimization is Not Bunk?

In a previous post I asked "Is Codon Optimization Bunk?", reflecting on a paper which showed that the typical rules for codon optimization appeared not to be highly predictive of the expression of GFP constructs. A paper released in PLoS One sheds new light on this question.

A quick review. To the first approximation, the genetic code consists of 3 nucleotide units called codon; there are 64 possible codons. Twenty amino acids plus stop are specified by these codons (again, 1st approximation). So, either a lot of codons are never used or at least some codons mean the same thing. In the coding system used by the vast majority of organisms, two amino acids are encoded with a single codon whereas all the others have 2, 3, 4 or 6 codons apiece (and stop gets 3). For amino acids with 2, 3 or 4 codons, it is the third position that makes the difference; for the three that have 6, they have one block of 4 which follows this pattern and one set of two which also differ from each other in the third position. For two amino acids with 6 codons, the two groups are next to each other so that you can think of the change between the blocks as a change in the second position; Ser is very strange in that the two blocks of codons are terribly like each other. For amino acids with two codons, the 3rd position is either a purine (A,G) or pyrimidine (C,T). For a given amino acid, these codons are not used equally by a given organism; the pattern of bias in codon usage is quite distinct for an organism and its close cousins and this has wide effects on the genome (and vice versa). For example, in some Streptomyces I have the codon bias pattern pretty much memorized: use G or C in the third position and you'll almost always pick a frequent codon; use A or T and you've picked a rarity. Some other organisms skew much the reverse; they like A or T in the third position.

Furthermore, within a species the genes can even be divided further. In E.coli, for example, there are roughly three classes of genes each with a distinctive codon usage signature. One class is rich in important proteins which the cell probably needs a lot of, the second class seems to have many proteins which see only a soupcon of expression and the third class is rich in proteins likely to have been recently acquired from other species.

So, it was natural to infer that this mattered for protein expression. In particular, if you try to express a protein from one species in another. Some species seemed to care more than others. E.coli has a reputation for being finicky and had one of the best studied systems. Not only did changing the codon usage over to a more E.coli system seem to help some proteins, but truly rare codons (used less than 5% of the time, though that is an arbitrary threshold) could cause all sorts of trouble.

However, the question remained how to optimize. Given all those interchangeable codons, a synthetic gene could have many permutations. Several major camps emerged with many variants, particularly amongst the gene synthesis companies. One school of thought said "maximize, maximize, maximize" -- pick the most frequently used codons in the target species. A second school said "context matters" -- and went to maximize the codon pair useage. A third school said "match the source!", meaning make the codon usage of the new coding sequence in the new species resemble the codon usage of the old coding region in the old species. This hedged for possible requirements for rare codons to ensure proper folding. Yet another school (which I belonged to) urged "balance", and chose to make the new coding region resemble a "typical" target species gene by sampling the codons based on their frequencies, throwing out the truly rare ones. A logic here is that hammering the same codon -- and thereby the same tRNA -- over and over would make that codon as good as rare.

The new work has some crumbs for many of these camps but not many; it suggests much was wrong with each -- or perhaps, the same thing was wrong with each. The problem is that even with these systems some proteins just didn't express well, leaving everyone scratching their heads. The GFP work seemed to suggest that the effects of codon usage were unpredictable if present, and in any case other factors, such as secondary structure near the ribosome, were what counted.

What the new work did is synthesize a modest number (40) of versions of two very different proteins (a single-chain antibody and an enzyme, each version specifying the same protein sequence but with a different set of codons. Within each type of protein, the expression varied over two logs; clearly something matters. Furthermore, they divided some of the best and worst expressors into thirds and made chimaeras, head of good and tail of bad (and vice versa). Some chimaeras seemed to have expression resembling their parent for the head end but others seemed to inherit from the tail end parent. So the GFP-based "ribosome binding site neighborhood secondary structure matters" hypothesis did not fare well with these tests.

After some computational slicing-and-dicing, what they did come up with is that codon usage matters. The twist is that it isn't matching the best used codons (CAI) that's important, as shown in the figure at the top which I'm fair-using. The codons that matter aren't necessarily the most used codons, but when cross-referenced with some data on which codons are most sensitive to starvation conditions the jackpot lights come on. When you use these as your guide, as shown below, the predictive ability is quite striking. In retrospect, this makes total sense: expressing a single protein at very high levels is probably going to deplete a number of amino acids. Indeed, this was the logic of the sampling approach. But, I don't believe any proponent of that approach ever predicted this.

Furthermore, not only does this work on the training set but new coding regions were prepared to test the model, and these new versions had expression levels consistent with the new model.

What of secondary structure near the ribosome? In some of the single-chain antibody constructs an effect could be seen, but it appears the codon usage effect is dominant. In conversations with the authors (more on this below), they mentioned that GFP is easy to code with secondary structure near the ribosome binding site; this is just an interesting interaction of the genetic code with the amino acid sequence of GFP. Since it is easy in this case to stumble on secondary structure, that effect shows up in that dataset.

This is all very interesting, but it is also practical. On the pure biology side, it does suggest that studying starvation is applicable to studying high level protein expression, which should enable further studies on this important problem. On the protein expression side, it suggests a new approach to optimizing expression of synthetic constructs. A catch however: this work was run by DNA2.0 and they have filed for patents and at least some of these patents have issued (e.g. US 7561972 and US 7561973). I mention this only to note that it is so and to give some starting points for reading further; clearly I have neither the expertise nor responsibility to interpret the legal meaning of patents.

Which brings us to one final note: this paper represents my first embargo! A representative of DNA2.0 contacted me back when my "bunk" post was written to mention that this work was going to emerge, and finally last week the curtain was lifted. Obviously they know how to keep a geek in suspense! They sent me the manuscript and engaged in a teleconference with the only proviso being that I continued to keep silent until the paper issued. I'm not sure I would have caught this paper otherwise, so I'm glad they alerted me; though clearly both the paper and this post are not bad press for DNA2.0. Good luck to them! Now that I'm on the other side of the fence, I'll buy my synthetic genes from anyone with a good price and a good design rationale.

ResearchBlogging.orgMark Welch, Sridhar Govindarajan, Jon E. Ness, Alan Villalobos, Austin Gurney, Jeremy Minshull1, Claes Gustafsson (2009). Design Parameters to Control Synthetic Gene Expression in Escherichia coli PLoS One, 4 (9) : 10.1371/journal.pone.0007002

Wednesday, September 09, 2009

A Blight's Genome

Normally this time of year I would be watching the weather forecasts checking for the dreaded early frost which slays tomato plants, often followed by weeks of mild weather that could have permitted further growth. Alas, this year that will clearly not be the case. A wet growing season and commercial stock contaminated with spores has led to an epidemic of late blight, and my tomato plants (as shown, with the night photography accentuating the horror) are being slaughtered. This weekend I'll clear the whole mess out & for the next few years plant somewhere else.

Late blight is particularly horrid as it attacks both foliage and fruits -- many tomato diseases simply kill the foliage. What looked like promising green tomatoes a week ago are now disgusting brown blobs.

Late blight is caused by Phytophora infestans, a fungus-like organism. An even more devastating historical manifestation of this ogre was the Great Irish Potato Famine and remains a scourge of potato farmers. Given my current difficulties with it, I was quite excited to see the publication of the Phytophora infestans genome sequence (by Sanger) in Nature today.

A sizable chunk of the paper is devoted to the general structure of the genome, which tops out at 240Mb. Two related plant pathogens, P.sojae (soybean root rot) and P.ramorum (sudden oak death) come in only at 95Mb and 65Mb respectively. What accounts for the increase? While the genome does not seem to be duplicated as a whole, a number of gene families implicated in plant pathogenesis have been found.

Also in great numbers are transposons. About a third of the genome are Gypsy-type retrotransposons. Several other classes of transposons are present also. In the end, just over a quarter (26%) of the genome is non-repetitive. While these transposons do not themselves appear to contain phytopathological genes, their presence appears to be driving expansion of some key families of such genes. Comparison of genomic scaffolds with the other two sequenced Phytophora show striking overall conservation of conserved genes, but with local rearrangements and expansion of the zones between conserved genes (Figure 1 plus S18 and S19). Continuing evolutionary activity in this space is shown by the fact that some of these genes are apparently inactivated but have only small numbers of mutations, suggesting very recent conversion to pseudogenes. A transposon polymorphism was also found -- an insertion in one haplotype which is absent in another (figure S9)

A curious additional effect shown off in two-D plots of 5' vs. 3' intergenic length (Figure 2. Overall this distribution is a huge blob, but for some of the pathogenesis gene classes are clustered in the quadrant where both intergenic regions are large -- conversely many of the core genes are clustered in the graph in the "both small" quadrant. Supplemental Figure S2 shows rather strikingly how splayed the distribution is for P.infestans -- other genomes show much tighter distributions but P.infestans seems to have quite a few intergenic regions at about every possible scale.

The news item accompanying the paper puts some perspective on all this: P.infestans is armed with lots of anti-plant weapons which enable it to evolve evasions to plant resistance mechanisms. A quoted plant scientist offers a glum perspective
After taking 15 years to incorporate this resistance in a cultivar, it would take Phytophthora infestans only a couple of years to defeat it.
. Chemical control of P.infestans reportedly works only before the infection is apparent and probably involves stuff I'd rather not play with.

A quick side trip to Wikipedia finds that the genus is a pack of blights. Indeed, Phytophora is coined from the Greek for "plant destruction". Other horticultural curses from this genus include alder root rot, rhododendron root rot, cinnamon root rot (not of cinnamon, but rather various woody plants) and fruit rots in a wide variety of useful & yummy fruits including strawberries, cucumbers and coconuts. What an ugly family tree!

The Wikipedia entry also sheds light on why these awfuls are referred to as "fungus-like". While they have a life cycle and some morphology similarities to fungi, their cell walls are mostly cellulose and molecular phylogenetics place them closer to plants than to fungi.

So, the P.infestans genome sequence sheds light how this pathogen can shift its attacks quickly. Unfortunately, as with human genomic medicine, it will take a long time to figure out how to outsmart these assaults, particularly in a manner practical and safe for commercial growers and home gardeners alike.
BJ Haas et al (2009). Genome sequence and analysis of the Irish potato famine pathogen Phytophthora infestans Nature : 10.1038/nature08358

Tuesday, September 08, 2009

Next-generation Physical Maps III: HAPPy Maps

A second paper which triggered my current physical map madness is a piece (open access!) arguing for the adaptation of HAPPY mapping to next-gen sequencing. This is intriguing in part because I see (and have) a need for cheap & facile access to the underlying technologies but also because I think there are some interesting computational problems (not touched on the paper, as I will elaborate below) and some additional uses to the general approach.

HAPPy mapping is a method developed by Simon Dear for physical mapping. The basic notion is that large DNA fragments (the size range determining several important parameters of the map; the maximum range between two markers is about 10 times the minimum resolution) are randomly gathered in pools which each contain approximately one half a genome equivalent. Practical issues limit the maximum size fragments to about 1Mb; any larger and they can't be handled in vitro. By typing markers across these pools, a map can be generated. If two markers are on different chromosomes or are farther apart than the DNA fragment size, then there will be no correlation between them. On the other hand, two markers which are very close together on a chromosome will tend to show up together in a pool. Traditionally, HAPPy pools have been typed by PCR assays designed to known sequences. One beauty of HAPPy mapping is that it is nearly universal; if you can extract high molecular weight DNA from an organism then a HAPPy map should be possible.

The next-gen version of this proposed by the authors would make HAPPy pools as before but then type them by sequence sampling the pools. Given that a HAPPy pool contains many orders of magnitude less DNA than current next-gen library protocols require, they propose using whole-genome amplification to boost the DNA. Then each pool would be converted to a bar-coded sequencing library. The final typing would be performed by incorporating these reads into a shotgun assembly and then scoring each contig as present or absent in a pool. Elegant!

When would this mapping occur? One suggestion is to first generate a rough assembly using standard shotgun sequencing, as this improves the estimate of the genome size which in turn enables the HAPPy pools to be optimally constructed so that any given fragment will be in 50% of the pools. Alternatively, if a good estimate of the genome size is known the HAPPy pools could potentially be the source of all of the shotgun data (this is hinted at).

One possible variation to this approach would be to replace bar-coded libraries and WGA with Helicos sequencing, which can theoretically work on very small amounts of DNA. Fragmenting such tiny amounts would be one challenge to be overcome, and of course the Helicos generates much shorter, lower-quality reads than the other platforms. But, since these reads are primarily for building a physical map (or, in sequence terms driving to larger supercontigs), that may not be fatal.

If going with one of the other next-gen platforms (as noted in the previous post in this series, perhaps microarrays make sense as a readout), there is the question of input DNA. For example, mammalian genomes range in size from 1.73pg to 8.40pg. A lot of next-gen library protocols seem to call for more like 1-10ug of DNA, or about 6 logs more. The HAPPy paper's authors suggest whole-genome amplification, which is reasonable but could potentially introduce bias. In particular, it could be problematic to allow reads from amplified DNA to be the primary or even a major source of reads for the assembly. As I've noted before, other approaches such as molecular inversion probes might be useful for low amounts, but have not been demonstrated to my knowledge with picograms of input DNA. However, today I stumbled on two papers, one from the Max Planck Institute and one from Stanford, which use digital PCR to quantitate next-gen libraries and assert that this can assist in successfully preparing libraries from tiny amounts of DNA. It may also be possible to deal with this issue by attempting more than the required number of libraries, determining which built successfully by digital PCR and then pooling a sufficient number of successful libraries.

The desirable number of HAPPY libraries and the desired sequencing depth for each library are two topics not covered well in the paper, which is unfortunate. The number of libraries presumably affects both resolution and confidence in the map. Pretty much the entire coverage of this is the tail of one paragraph
In the Illumina/Solexa system, DNA can be randomly sheared and amplified with primers that contain a 3 bp barcode. Using current instruments, reagents, and protocols, one Solexa "lane" generates ~120 Mb in ~3 million reads of ~40 bp. When each Solexa lane is multiplexed with 12 barcodes, for example, it will provide on average, ~10 Mb of sequence in ~250,000 reads for each sample. At this level of multiplexing, one Solexa instrument "run" (7 lanes plus control) would allow tag sequencing of 84 HAPPY samples. This means, one can finish 192 HAPPY samples in a maximum of three runs. New-generation sequencing combined with the barcode technique will produce innumerous amounts of sequences for assembly.

Changing the marker system from direct testing of sequence-tagged sites by PCR to sequencing-based sampling has an important implication as discussed in the last post. If your PCR is working well, then if a pool contains a target it will come up positive. But with the sequencing, there is a very real chance of not detecting a marker present in a pool. This probability will depend on the size of the target -- very large contigs will have very little chance of being missed, but as contigs go smaller their probability of being missed goes up. Furthermore, actual size won't be as important as the effective size: the amount of sequence which can be reliably aligned. In other words, two contigs might be the same length, but if one has a higher repeat count that contig will be more easily detectable. These parameters in turn can be estimated from the actual data.

The actual size of the pool is a critical parameter as well. So, the sampling depth (for a given haploid genome size) will determine

In any case, the problem of false negatives must be addressed. One approach is to only map contigs which are unlikely to have ever been missed. However, that means losing the ability to map smaller contigs. Presumably there are clever computational approaches to either impute missing data or simply deal with it.

It should also be noted that HAPPy maps, like many physical mapping techniques, are likely to yield long-range haplotype information. Hence, even after sequencing one individual the approach will retain utility. Indeed, this seems to be the tack that Complete Genomics is taking to obtain this information for human genomes, though they call it Long Fragment Reads. It is worth noting that the haplotyping application has one clear difference from straight HAPPy mapping. In HAPPy mapping, the optimal pool size is one in which any given genome fragment is expected to appear in half the pools, which means pools of about 0.7X genome. But for haplotyping (and for trying to count copy numbers and similar structural issues), it is desirable to have the pools much smaller, as this information can only be obtained if a given region of the genome is haploid in that pool. Ideally, this would mean each fragment in its own pool (library), but realistically this will mean as small a pool size as one can make and still cover the whole genome in the targeted number of pools. Genomes of higher ploidies, such as many crops which are tetraploid, hexaploid or even octaploid, would probably require more pools with lower genomic fractions in order to resolve haplotypes.

In conclusion, HAPPy mapping comes close to my personal ideal of a purely in vitro mapping system which looks like a clever means of preparing next-gen libraries. The minimum and maximum distances resolvable are about 10-fold apart, so more than one set of HAPPy libraries is likely to be desirable for an organism. Typically this is two sizes, since the maximum fragment size is around 1Mb (and may be smaller from organisms with difficult to extract DNA). A key problem to resolve is that HAPPy pools contain single digit picograms of DNA. Amplification is a potential solution but may introduce bias; clever library preparation (or screening) may be another approach. An open problem is the best depth of coverage of the multiplexed HAPPy next-gen libraries. HAPPy can be used both for physical mapping and long-range haplotyping, though the fraction of genome in a pool will differ for these different applications.

ResearchBlogging.orgJiang Z, Rokhsar DS, & Harland RM (2009). Old can be new again: HAPPY whole genome sequencing, mapping and assembly. International journal of biological sciences, 5 (4), 298-303 PMID: 19381348

Monday, September 07, 2009

Farewell to Summer

The autumnal equinox is still a few weeks away, but today marks the traditional end of the summer vacation season. How does a genomics geek dress for the beach? I present the KR's Career Collection, with sun protection from Infinity, a lovely Codon Devices beach bag and a towel from Millennium, with gorgeous Crane's Beach in Ipswich lending the background. This collection was more spontaneous than planned, so I didn't dig for one of my remaining Harvard T-shirts or the Centocor frisbee that's stashed somewhere.

I haven't yet worked for an Iguana Pharmaceuticals, so perhaps that is a portent of my future. And while I am quite happy in my job, if someone were to start a bioinformatics shop on St. Thomas, I would have to listen to the pitch...

Thursday, September 03, 2009

Physical Maps II: Reading the signposts

How do we build physical map? In the abstract, a physical map is built by first dividing the genome up into either individual pieces or pools of pieces. These pieces need to be somewhere between large and gigantic; the size of the pieces determines the resolution of the map and its ability to span confusing or repetitive genomic regions. Ideally the pieces have a tight size distribution, but that isn't always the case. A set of known sequences is then typed against all these pools or pieces and that data fed into the appropriate algorithm to build a map.

Each physical mapping technology has its own parameters of the number of pieces or pools, whether they are pieces or pools (though of course we can always pool pieces, but it can be challenging to piece pools!), what contaminating DNA is definitively introduced by the mapping technology and other aspects. For some pool-based technologies, 100-200 pools (or more likely, 96, 192, 288 or 384) can generate a useful map. Technologies also differ in how universal they are; most are quite nearly so but some may be limited to certain neighborhoods of biological space.


If next-generation sequencing is to worm its way into this process, clearly it is at the point of reading these known sequences. However, it is also important to not just assume that next gen should worm its way in; it must be better than the alternative(s) to do that. Right now, the king of the hill is microarrays.

Several microarray platforms enable designing and building a custom array of quite high density (10's or 100's of thousands of spots easily -- and perhaps even millions) and typing them on DNA pools for perhaps $100-500 per sample. Using a two-color scheme allows two samples to be typed per array, compressing costs. Alternatively, certain other formats would allow getting long-range haplotype information but at one chip per sample. This also points out one possible reason to pick another method; if it could get substantially more information and that information is worth some additional cost.

Remember, the estimate for constructing a physical map for a mammalian species is on the order of $100K, and this is for a method (radiation hybrid) which appears to need approximately 100 pools for a decent map. So, given the costs estimated above that would be perhaps as low as $5000 ($100/sample pair) to $50,000 -- quite a range. Any next-gen approach needs to come at that price or lower for typing the pools -- unless it somehow delivers some really valuable information OR is the major route of acquiring sequence (in which case the physical map cost is folded into the sequencing cost). That's unlikely to be a wise strategy, but I'll cover that in another post dealing with a technology that is tempting in that direction.

Arrays have a bunch of other advantages. If you have contaminating DNA -- and know its sequence -- you can use that to design array probes that won't hybridize to the contaminant. In some cases that may blind the design to some important genes, but often it won't be a problem. Probes can also be chosen very deliberately; there's no element of chance here. Probes can also be chosen almost regardless of the size of a sequence island, so even very tiny islands in a draft assembly can be placed on the physical map.

Shotgun Sequencing

In contrast, another approach would be to use shotgun libraries for the mapping. Each library would represent a different pool, and to keep cost from mushrooming a high degree of multiplexing will be required. Having the ability to cheaply prepare and multiplex 10s or 100s of libraries together will be a generally useful technology. But, a straight shotgun approach has all sorts of drawbacks versus arrays.

First, any contaminant will probably be sequenced also, reducing the yield. For some mapping strategies (e.g. BACs), this might be 5-10% "overburden". But for radiation hybrid maps, the number might be more like 60%. Furthermore, repetitive sequences that can't be uniquely mapped will further reduce the amount of useful data from a mapping run.

Second, shotgun sequencing means sampling the available sequence pool, which for mapping means the probability of seeing a contig if it is present in a pool is dependent on the size of the contig. More precisely, it is dependent on the effective size of the contig, the amount of uniquely mappable sequence which can be derived from it. Since most mapping techniques use both the presence and absence of a landmark in a pool to derive information, this will mean that only contigs above a certain effective size can be confidently mapped; very small contigs will be frequently incorrectly scored as negative for a pool.

Targeted Sequencing

The other option is targeted sequencing. Again, having the ability to cheaply generate multiplex tagged targeted subsets will be very valuable. Many targeted strategies have been proposed, but I think there are two which might be useful in this context.

The first is the sorts of padlock probe (aka molecular inversion probe) design which has shown up in a number of papers. These work with exquisitely small quantities of DNA -- though (as will come up in a later post) improving their sensitivity by 1000X would be really useful. Cost of fabrication is still an issue; they can be built on microarrays but require downstream amplification. The design issues are apparently being worked out; early papers had very high failure rates (many probes were never seen in the results).

The other technology is microarray capture. However, this can't compete on cost unless a lot of libraries can go against the same array. In other words, there would be a need to pool the multiplexed libraries prior to selection -- another cool capability that would find many uses but has not yet been demonstrated.


Could any of these next-gen approaches compete with direct hybridization of labeled pools to microarrays? Given the wide (10X) swing in costs I estimated above, it's hard to be sure but I am wondering if perhaps for many mapping approaches microarray hybridization may have some longevity as the landmark reading method. However, where either the economics become very different or next-gen can extract additional information, that is where sequencing may have an edge. This is a useful way to focus thoughts as we proceed through the different mapping approaches.

Tuesday, September 01, 2009

Physical Maps: Part I of a series

A recent review/opinion paper I saw has led me to attempt to write something substantial about a topic I've generally given short shrift to: physical mapping. Given that fact, I won't be surprised if some of what I say will lead to corrections, which are always welcome, as with the recent bits on genome assembly which I posted here.

A physical map is an ordering of markers along a stretch of DNA, often with some concept of distance between the markers. The most useful physical maps link these markers to unambiguous islands of sequence. Common examples of physical maps include restriction maps (where the landmarks are sites for one or more restriction enzymes) and cytogenetic maps (where the sequence landmarks are placed in relation to the pattern of bands visible in stained DNA).

When I was in the Church lab, there just wasn't much talk about physical maps. Our goal was to do it all by sequencing. In a sense, a complete genome sequence is the ultimate physical map and we really didn't want to bother with the less useful versions. We did perhaps find curious some of the "map into the ground" genome sequencing strategies presented at conferences. If I remember correctly, one of these was going to first generate a map a BAC level, then break each BAC into cosmids & map those within the BAC, then break those into lambda clones and map them then break those up into M13-size clones (or use transposon insertions) and map those. The goal was to ultimately plan the absolute minimum number of sequencing reads to get the job done. George's group would rather just make reads too cheap to worry about such stuff, and of course in the end that viewpoint won.

The catch, as the review I saw by Stephen J. O'Brien and colleagues, is that we still aren't at the stage where we can just take some species of interest, extract the DNA, run it through the sequencers and get a finished genome. Good physical maps help assemble all the little islands of sequence into a larger whole & help find errors in the assembly, particularly when dealing with repetitive sequences. The review cites the example of platypus (discussed in a previous post and follow-up). It is a bit curious that it describes this beast as lacking a physical map, as a BAC map was generated as part of the sequencing effort. In any case, the platypus project yielded an assembly which O'Brien and colleagues deem too sketchy at long ranges to reliably make evolutionary inferences.

Now some of your level of concern for such matters depends on what scale of genetic organization you are interested in. My personal interest has often tended to be on the level of individual genes (in eukaryotes; a bunch of my thesis work looked at operon-level stuff in bacteria) but O'Brien has a long-standing interest in the evolution of genomes. If you want to understand things on that scale, you need a good map. Very detailed maps can also be used to either check for or prevent errors in sequence assembly.

There's also the question of really difficult genomes. Many valuable crops species are hexaploid or octaploid, which presumably will make shotgun assembly even more challenging than with diploid genomes. Other interesting genomesresulted from recent duplication events or genome fusions. For example, common tobacco is an evolutionarlily recent fusion of two other tobacco species.

A key question is how (or whether) to yank physical mapping into the world of next-generation sequencing. The review estimates the cost of physical mapping a mammalian genome around $100K. With the cost of getting a shotgun sequence of a similar genome quickly heading to the $1-10K region, it will be an expensive upgrade to have a physical map. It's also not where the interest (or money) seems to be going either in terms of technology development. Now, one could try to wait for promised super-long-read technologies to really show up, but that could be a while and even these may have reads in tens of kilobases, whereas some segmental duplications may be much larger. Any proposed next-gen rejiggering of a physical mapping technique would need to come in under $100K. Alas, in many cases I don't know how to estimate some of the costs but I will try to when I can.

Paired-end & mate-paired reads provide one approach to generating a map, and I think there are some protocols in which the pairs . That's a great start, but what folks like O'Brien are looking for are more on the scale of many tens or even hundreds of kilobases or even megabases.

It's worth noting likely change in physical mapping. In the past, it was considered an prerequisite for sequencing and was much cheaper. Now, a detailed physical map may be viewed as a desirable add-on, an option for improving a less-successful assembly or (as we shall see) and integrated part of the genome assembly process.

In the next installment, I'll take a side look at some general issues with using next-gen sequencing for physical map construction and the alternative approach of microarrays. The current plan is to follow that with a look at one approach called HAPPy mapping which has been proposed which elegantly attuned to the next gen world & also offer some personal variants on the approach. The entry after that will look at Radiation Hybrid maps and then I'll tackle clone (BAC & fosmid) maps and then cytogenetic maps. Along the way, I'll throw out some crazy ideas and also try to identify ways in which these strategies might leverage other trends in the field and/or provide more impetus to develop some useful capabilities. well, that's the plan right now -- but since I'm drafting about one entry ahead of the one posted, that outline is subject to change.