Tuesday, March 24, 2009

Codon's Type IIS Meganuclease

When I joined Codon Devices, I swore I would not use this space to shamelessly tout any results from the company. It turned out my resolve was never tested. It's not that there weren't interesting results being generated in the company, but that in one way or another they never became public. Some results were never meant to be public, but were within collaborations, whereas some intended to be public got held up by one snag or another.

Perhaps the universe does like to play subtle jokes on us. Now that I'm out, so is the first publication from the company, describing the engineering of a Type IIS restriction enzyme with a very large recognition sequence.

TypeIIS restriction endonucleases are handy for many purposes, but particularly for gene construction techniques. Whereas most restriction enzymes recognize and cut at the same site, Type IIS enzymes recognize a specific site but then cut a precise distance away (or cut at perhaps two different offsets; note Fig 2 of this reference). This is handy because it allows one to design two pieces to come together (via the sticky overhangs generated by the enzyme) but without the recognition sequence in the final product. Hence, Type IIS enzymes can allow virtually any sequence to be built.

The catch, of course, is that it is challenging to build in this fashion a sequence which itself contains the Type IIS recognition sequence. Ideally, these sequences would be very long and hence unlikely to appear by chance. Unfortunately, the known Type IIS enzymes almost all have 5 or 6 basepair long recognition sequences, which are not terribly rare once you get in the multiple kilobase range, and are certainly not rare if you want to build chromosome-sized DNA.

So the goal of a number of efforts has been to build a Type IIS restriction enzyme which has a very long recognition sequence. Enzymes called homing endonucleases have huge recognition sequences, with effective lengths of 12 or more basepairs (the actual lengths are greater, but there is also some positions which are not fully fixed to a particular nucleotide -- hence the term effective length). The advance of Lippow et al is that a new level of precision was obtained in the cutting sites, a level of precision compatible with gene engineering.

In a sense, the problem is analogous to that of a K9 unit. The handler has a potentially vicious dog which she would like to apply precisely. Give the dog too short a leash and you can't deploy its teeth; give it too long a leash and the teeth may sink into places other than where you want them to.

So what Lippow et al did is build different protein linkers to tie the DNA recognition domain (handler) to the cleavage domain (dog) from the Type IIS enzyme FokI. By run-off Sanger sequencing, in which the polymerase is allowed to extend to the end of a DNA strand, they showed that cutting is precise, particularly for one of the specific enzymes generated. The dog, alas, is not under complete control; some random off-site cutting is observed. But it is a step forward.

One last hitch: to be particularly useful, one really needs at least two Type IIS meganucleases, and ideally many. Alas, this paper provides only one -- but it is a roadmap to building more, as there are a number of other homing endonucleases which could be potentially used for recognition modules. Alternatively, a number of papers have generated Sce-I variants with different recognition specificities, so by introducing these mutations into the CdnI enzyme reported here should allow a new set of Type IIS meganuclease specificities.

Monday, March 23, 2009

JAK2 haplotype promotes JAK2 mutation

An interesting trio (Klipivaara et al, Jones et al, Olcaydu et al) of abstracts from the Nature Genetics Advance Online Publications site (alas, I don't have fulltext access without traipsing in to MIT or Harvard to use the library, but more on that soon).

JAK2 (Janus Kinase 2) is a protein kinase important in hematopoeitic cell function, and a particular mutation was shown several years ago to result in several distinct but related myeloproliferative disorders.

In these papers, particular haplotypes (given only the abstracts, its impossible to determine if there is complete agreement on which ones) lead to a higher risk of the disease-causing V617F mutation. What is quite striking is that the mutation occurs in cis to the haplotype, that is to say the same chromosome with the haplotype tends to be the one bearing the mutation.

The explanation favored by the papers appears to be that the haplotype somehow creates a favorable DNA context for causing the mutation. If the mutations showed up in trans (on the other chromosome) just as often, one might contemplate a mechanism whereby the haplotype somehow increases the selective advantage of V617F -- perhaps, for example, by causing incorrect JAK2 expression.

It will be fascinating to see this story play out -- of what DNA mutational or repair mechanism does the haplotype shift the balance? And, now that this is precedented you can be sure there will be a lot of searching for other examples. A quick screen would be to look for mutational haplotypes which contain known oncogenic mutations, and then go screening somatic samples for those haplotypes. Of course, with sequencing getting so cheap, the not too distant future will have lots of paired somatic and tumor complete genomes to compare.

Friday, March 20, 2009

TGA Codon


Well, after a few successful readthroughs I've been hit with my career's Release Factor again.

Looking on the bright side, this will give me time & focus to write here and to tackle two invited articles.

I'm also entertaining short-term consulting gigs in the Boston area (or, with travel expenses included, in cities with resident Ailuropoda melanoleuca :-) But that's just a stop-gap; what I'd really like is a permanent position to again do tackle interesting scientific questions in the interface between biology and computing

Wednesday, March 18, 2009

One helix to teach them all, and in the taxonomy bind them?

I originally saw this last summer in some free tourist guide, and neglected to write on it, but a little googling verified my memory. There is a game show on one of the channels now called "Are you smarter than a 5th grader", in which adults go up against 5th graders in a quiz show format, with the questions supposedly representative of that sample of elementary school. When I saw this particular item, my eyes rolled at first but then I pondered some more -- and realized that while I'd probably stick to my original position, it is a bit more nuanced than my first reaction.

Name 3 of the 5 kingdoms.


Okay, this was enough to generate an autonomic response. Back in high school we probably a good chunk of a class going over various Kingdom proposals. I don't have that textbook, but one of a similar strata would be my freshman bio textbook, Biological Science by Keeton & Gould, 4th Edition. K&G (p.1019) outlines eight different kingdom systems, ranging from 2 to 8 kingdoms.

Now, of course, one must ask what exactly is a kingdom? Ideally a kingdom would consist of a bunch of organisms with a common theme (which wouldn't be simply the lack of the all the themes of other kingdoms), all organisms with that theme would be in the kingdom, and no extant organism outside that kingdom would trace its ancestry to a member of that kingdom. At least, off the cuff, that is definition I would give.

So which one induced a reflex? It is the five kingdom system: Plant, Fungi, Protists, Animals & Monera, which it turns out is the one Keeton & Gould used for organizing their survey of the living world.

Now, it isn't an awful system, particularly back in the late '80s when I had it. Monera are all the single-celled thingies which lack a nucleus. Eukaryotes are what we know best, so they are subdivided into single celled (Protists), multi-cellular with cell walls & photosynthesis (Plants), multi-cellular, with cell walls but never photosynthetic (Fungi) and multi-cellular with no walls (Animals).

In that era, issues with these grouping were certainly recognized and taught. Yeasts clearly were related to Fungi, so they went there despite unicellularity. Some plants lack photosynthesis (e.g. dodder), but clearly this is a late loss and they belong in Plants. Protists is a handy way to lasso all sorts of traditional problems such as Euglena, which both photosynthesizes and moves.

But, what was just emerging when I was taught these things, but is now quite evident, is that the non-nucleated world is really two worlds, Eubacteria and Archea. While they both have many similarities (such as mostly circular chromosomes), they are very, very different in other fundamental cellular processes, such as RNA transcription. Plus, now we have DNA & RNA phylogenetic methods which show them to have diverged very long ago.

There are other issues DNA methods have illuminated. Protists are not an evolutionarily coherent group but are instead a mishmash of various lineages ("polyphyletic"). Eukaryotes as a whole don't fit a simple tree lineage, due to multiple endosymbiont captures resulting in organelles such as mitochondria and chloroplasts (and perhaps more).

Which asks the question: what should we be teaching 5th graders? My reflex reaction is that we shouldn't teach them things they'll need to unlearn later, and the Monera kingdom concept is just not a very good one in the light of molecular phylogenies. But, what my further pondering brought up is one goal of science education is to teach students to methods of science rather than just rote facts. Given a microscope or some photographs, it is pretty easy to teach a young student how to classify organisms into the 5 kingdom system. Trying to explain why archea and eubacteria should be in different groups isn't so easy. Okay, a lot of archea have pretty wierd lifestyles (insanely low pH, even more insanely high heavy metal content, boiling water, etc), but not all do. Just being strange to us isn't really a useful way to categorize.

On the other hand, perhaps at least the notion of molecular classification can be introduced early. Granted, it's an N of 1, but I've successfully shown that you can teach the concept to a 3rd grader. It's also something which can be easy to diagram out & count -- with (obviously!) only a subset of informative positions. And in the end, wouldn't that be the best science lesson of all -- that things which look superficially alike may have an underlying, nearly hidden great difference?

Of course, the hardest part of any change is getting change. It appears that a generation of science teachers have been taught the 5 kingdom system, and so will need to be updated. Numerous textbooks probably also encapsulate this archaic (but not archean! :-) concept. Probably the hardest to change will be those statewide curriculum standards or standardized tests which contain these phylogenetic fossils.

Sunday, March 08, 2009

The next level in genomics term papers

I've been intrigued for a few months now since hearing about a St. Louis company called Cofactor Genomics. Right on their front webpage they advertise they will generate & assemble 680Mb of sequence (from an Illumina machine) for the paltry sum of $4.7K.

Wow! That would fit on my credit card when I was a graduate student (though it would have been a few months stipend). 680Mb is 100+X coverage of an E.coli-class genome, or about 50X coverage of Saccharomyces. It's even well over 0.5X coverage of an awful lot of interesting eukaryotes.

As an aside, I feel obligated to stress that I don't have any personal stake in, or direct relationship with, Cofactor Genomics. I also have no experience with them or any of their competitors. It's just the ease of accessing their pricing matrix makes them easy to talk about.

At those prices, the idea of doing my own personal genome project can't be easily shooed away. Not a Personal Genome Project -- I worry I'd develop genomania -- but some small genome sequenced on my whim. There's probably still not a shortage of interesting genomes in species I could easily & safely grow up with some forbearance of my shop's management or at a friendly academic. There must be some left; there are even some industrially-interesting E.coli strains that seem to lack public sequences. However, even if it wouldn't violate my town's zoning laws to do it in my basement, neither growing biological samples nor the $5K budget would fly with my spouse.

So I'll float a different idea. My only wish is that anyone who tries it post back here, and if you're already doing the same thing I invite your response as well. If I can't do it, why not some class?

Now $5K isn't chicken feed. I'm sure that is far beyond the typical budget for lab experiments in a college class, let alone a high school. Maybe a donor could step in, but these days that's a particularly tough challenge to find. But suppose the cost were spread over a lot of students?

One scenario would be for a very large university to make this the project for an entire class. A really huge state school I would guess could have 500+ students a year taking first-year biology. Now we're talking less than $10/student -- perhaps still a significant hit (what is a typical per student budget for such a course?). Each student would get about 1/500th of the genome as their very own research project.

At a smaller school, could a genome project become a departmental initiative? A bioinformatics class could set up the analysis pipeline & develop reporting tools. Biochemistry class could map the ORFs to the known biochemical pathways and identify both missing pathways and predicted novel (to the species) enzyme activities. Genetics classes could focus on operon structure or identifying possible regions recently transferred horizontally from another species. Evolution classes could tackle that, or building a bazillion gene trees. A bit of a stretch to work this into a human physiology curriculum, though a comparative look at how another biological system manages homeostasis isn't completely absurd.

Of course, when it comes time to publish it will be a very long author list!

I think I've heard of a genome project being run as an undergraduate effort, but I'm guessing a lot of that involved doing the actual sequencing. While there's merit to that, these days even with free labor, large-scale Sanger sequencing isn't cost competitive. Perhaps some departments have one of the next-gen machines & are willing to let some undergraduates play with them -- but I'm guessing that's pretty rare (like a NotI site in an AT-rich genome).

Will sequencing costs ever crash low enough that someone will sequence a genome for an grade school science fair project? I'm not holding my breath, but I certainly wouldn't rule it out.

Tuesday, March 03, 2009

MGH To Mutation-Type All Cancer Patients

Today's Boston Globe carried a front page item that Massachusetts General Hospital is planning to screen all cancer patients for a battery of about 110 common cancer mutations in 13 genes. MGH is apparently the first hospital to go this in depth on every patient.


This is an exciting push forward into personalized medicine, and it makes sense for a teaching hospital such as MGH to leap into the void. This sort of typing makes intuitive sense, but (as the article states) its clinical value remains to be proven. A few patients, such as one profiled in the piece, will have radical changes in treatment which benefit the patient -- in the example a woman had a relatively rare kinase fusion (to EML4-ALK) for which an investigational drug was available -- and she responded spectacularly. But for many patients, the mutations found won't change care because there isn't a known way to target their mutation spectrum.

But, the huge value will be longer term as MGH builds a database of mutations and responses to treatment -- such a database will almost certainly provide new ideas for treatment, ideas which a research-focused hospital will be willing & able to try out. As more mutations are linked to cancer outcomes and screening costs come down, surely the panel will be expanded. MGH is also presumably planning to screen patients on both initial diagnosis and after relapses, so an increasingly rich database of mutations appearing during cancer progression will emerge.

It will be interesting to see how many other hospitals here -- and elsewhere -- follow. Boston has a small herd of top-notch hospitals and most (if not all) have significant cancer centers (with one, Dana Farber, completely focused on the subject). Ideally the results of many such screens could be pooled into one or more common databases, with of course the need to protect patient confidentiality.

One barrier may be cost. The Globe article pegs it at $2000, and states it is unclear if insurers will pay -- in the past they have demanded proof of clinical value. While that isn't an indefensible position, it would be in their self-interest to chip in -- perhaps a prorated amount. First, it's lousy PR to not pay for diagnostics that are likely to work (and the drumbeat for single-payer is pretty much constant in the same paper). Second, the tests are likely to provide useful information some fraction of the time -- and in those cases may provide cost savings. MGH is apparently considering eating the cost or asking the patients to kick some in.

MGH may also be setting the price point for such services. $2K isn't far from the $4K that Complete Genomics claims it will be able to run a complete genome in the not-too-distant-future. $2K probably is in the ballpark already for sequencing off capture arrays.

Of course, budgets for diagnostics aren't infinite. Will such initiatives be knocking elbows with other genomics-driven diagnostics, such as the existing array-based assays (e.g. OncotypeDX, CupPrint)? Will greater value come from methylation profiling or other assays which evaluate markers not available to current sequencing technologies? Time will tell.