When I was a junior in high school, on a day much like today, I wanted to stay home and watch TV a bit, so I was hoping the wintry weather would generate a snow day. I didn't often wish for this, as my childhood love of snow had subsided substantially (though I would sometimes ski through my yard), but on this day I wanted to be home. Winter and the superintendent, however, did not cooperate and we had only a delayed opening, and hooky was out of the question in my family so off I went.
And so I was sitting in Mr. Schmidt's chemistry class that morning. He was a nice man, but that class did very little to prepare me for a life on the periphery of chemistry, except that he did an excellent job of outlining the early 20th century revolution in chemistry & physics. I do not remember what he was talking about that morning when Mrs. Kurtz, the Biology II teacher, came in and commented on a news event. We all nodded, given we expected the news -- but then she restated herself as we had not heard her, and Mr. Schmidt got out the TV in his closet and I found myself watching TV that morning -- exactly what I had hoped to watch on a snow day but also nothing I had ever imagined or could have remotely hoped to watch. For that restatement was: "No, the space shuttle blew up!".
When my boy was three we were going one weekend to take him to the Boston Children's Museum, a wonderful place for a child of that age to explore and run around and have fun. As a bonus, we would ride the subway there and oh how he loves to ride trains. It was again a winter day and I drove the usual route to Boston & there is a spot on I-93 where you come out of the relatively untouched beauty of the Middlesex Fells and the skyline of Boston suddenly appears. It was in that spot that I heard the report on radio whose meaning became instantly clear, and I semi-silently cried "No!" -- an extended loss of radio contact with a space shuttle could not ever end happily.
We are in the midst of that grim week of anniversaries for NASA; yesterday marked the 42nd anniversary of Apollo 1, today the 23rd anniversary of the loss of Challenger and Sunday is 6th anniversary of the loss of Columbia. Only one of those events has any obvious connection to this time of year.
For as long as I can remember the space program has had an outsized influence on my imagination. My career path did not take me in a good direction to go to space, but I still think about it almost daily. In some ways these three disasters are completely removed from what I do, but in other ways they are not. I do subscribe to Edward Tufte's argument that poor data visualization helped enable the Challenger disaster, and while my plots do not carry such weighty implications I still must be ready in case they ever do. All three of these were hardware failures, and I do software, but software failures have caused unmanned probes to be lost and manned missions to go awry.
But of all else, it is important to remember those who pushed the limits and did not return. We must remember who they were and why they died, as they died doing important things and they died because humans make mistakes. Grissom, White & Chaffee were doomed by a design from which escape was impossible and fire likely. Smith, Scobee, McNair, Onizuka, McAuliffe, Jarvis & Resnik died when a machine was run far outside its normal operating regime. Brown, Husband, Clark, Chawla, Anderson, McCool and Ramon died from a design which was not well matched to the materials used to construct it.
We recently learned some more details of the Columbia accident: how the astronauts never realized the disaster approaching them, but how pilot McCool worked calmly to deal with systematic failure just before it killed him. I wish I could have such coolness under stress.
A computational biologist's personal views on new technologies & publications on genomics & proteomics and their impact on drug discovery
Wednesday, January 28, 2009
Monday, January 26, 2009
Next, exploding DNA packs at the banks
I use gmail for my personal mail & actually tend to enjoy the sidebar ads. Yes, most are silly or uninteresting, but once in a while there are some odd or amusing ones. There are also some patterns -- email from my one brother often brings up inane creationist sites (which I click through to -- I figure I'd rather Google have their money than them), as we are often talking about chimps -- and that is clearly one of their buzzwords.
So here's a use for DNA that would have never occurred to me: tagging burglars with it. Or more importantly, threatening to tag them with it. All sorts of claims are made that the appearance of surveillance is nearly as useful as actual surveillance for deterring property crime, so I guess this is in that bucket.
Will it work? Will some enterprising criminal start marketing DNase spray? When will it show up on CSI?
So here's a use for DNA that would have never occurred to me: tagging burglars with it. Or more importantly, threatening to tag them with it. All sorts of claims are made that the appearance of surveillance is nearly as useful as actual surveillance for deterring property crime, so I guess this is in that bucket.
Multiple SelectaDNA Spray heads can be fitted at the entry points of premises and on activation emit a burst of SelectaDNA solution onto the offenders. The solution contains a UV tracer and a unique DNA code, linking them irrefutably to the crime scene. The DNA Spray can be armed by a panic button and/or linked to an existing intruder alarm system. As the DNA fear-factor amongst criminals is high, it is likely that sprayed intruders will flee the crime scene before stealing any goods.
Will it work? Will some enterprising criminal start marketing DNase spray? When will it show up on CSI?
Sunday, January 25, 2009
Are the old lessons being forgotten?
Okay, first I feel like I have to have a bit of preamble. This, and another post I'm doing the homework on, are pretty critical. Downright negative. I'm not turning into a curmudgeon or planning to turn this space into a rant-a-thon. It's just that both are topics I think are important & have pushed the right buttons.
Also, this isn't meant to be high-and-mighty-and-spotless-expert calling calumny on the great unwashed masses. If I look down at my metaphorical foot I find many tightly spaced patterns of scars, sometimes nearly concentric. We all make mistakes, and often we repeat those of the past. We think we've covered bases that have always been covered or deceive ourselves that safety mechanisms which were needed in the past are no longer necessary.
A bit ago at work I was doing some exploring of a standard a backbone and became curious just how taxonomically widespread pieces of the backbone might be found naturally. So naturally, I pumped the sequence into the NCBI BLASTN server & pointed it at the RefSeq genomes. As expected, a bunch of bacterial plasmids popped up. What was unsettling, though, was a bunch of provisional genomic RefSeqs for eukaryotic chromosomes. Indeed, one project had apparently deposited every chromosome with a pUC-type vector sequence at one end. YIKES!
The other day I got curious again & tried searching the non-redundant DNA and protein databases but with the species filter set to eukaryote. Again, a bunch of hits -- and the shocking part was many were very recently deposited sequences -- even human ones. In some cases, the entire deposited sequence was vector-derived (e.g. the non-human "putative reverse transcriptases" ABK60177.1, CAD59768.1, CAD59767.1 & CAL37000.1).
For example, AK302803.1 is a 1352 nucleotide sequence deposited in 2008; from 888 on is clearly vector -- and the coding region is annotated as 1 to 1275! CAH85743 is a "Plasmodium" protein which is entirely vector derived; again deposited in 2008. PIR (is anybody still curating this?) has a number of vector-derived proteins (e.g. the 231 amino acid "NZ-3 antigen" JC7702; S.pombe beta-lactamase (!) T51301); I was surprised to even find a SwissProt entry that looks like it has pUC-derived sequence
Even the RefSeq mRNA section has some very provisional mammalian predicted cDNAs (from chimp) which appear to be polylinker-type sequences from vector (selected restriction sites are marked)
Contamination of various sorts has plagued genome projects from the get-go. Perhaps the most notorious was a large deposition of human ESTs which were donated to the public with great fanfare (as a counterpoint to private EST efforts), only to be found later to be rich in yeast sequences. The solution is to run filters -- search everything you do against vectors, E.coli and other common contaminants. In addition, especially in this day-and-age, if your "human" mRNA sequence doesn't match the genome, you've got some 'splaining to do.
What's the harm? Well, when it comes to databases I don't like mess. You always need to check your data, but it's always a nuisance when you actually have to clean it a bunch. Miss something, and some experiment is dirty or worse ruined. Plus, and this is a bit of the theme to my proto-post, some folks haven't yet figured this out & the results are truly ugly. Even worse, these are the obvious problems since bacterial vectors in a eukaryotic sequence truly stick out. Now I'm wondering about all the pUC-like sequences I found in bacterial sources -- can I trust them either?
So, let's all make a it's-still-a-pretty-new-year resolution to recheck our sequencing pipelines. Deliberately throw pUC19 and the E.coli genome through it & see what comes out.
Also, this isn't meant to be high-and-mighty-and-spotless-expert calling calumny on the great unwashed masses. If I look down at my metaphorical foot I find many tightly spaced patterns of scars, sometimes nearly concentric. We all make mistakes, and often we repeat those of the past. We think we've covered bases that have always been covered or deceive ourselves that safety mechanisms which were needed in the past are no longer necessary.
A bit ago at work I was doing some exploring of a standard a backbone and became curious just how taxonomically widespread pieces of the backbone might be found naturally. So naturally, I pumped the sequence into the NCBI BLASTN server & pointed it at the RefSeq genomes. As expected, a bunch of bacterial plasmids popped up. What was unsettling, though, was a bunch of provisional genomic RefSeqs for eukaryotic chromosomes. Indeed, one project had apparently deposited every chromosome with a pUC-type vector sequence at one end. YIKES!
The other day I got curious again & tried searching the non-redundant DNA and protein databases but with the species filter set to eukaryote. Again, a bunch of hits -- and the shocking part was many were very recently deposited sequences -- even human ones. In some cases, the entire deposited sequence was vector-derived (e.g. the non-human "putative reverse transcriptases" ABK60177.1, CAD59768.1, CAD59767.1 & CAL37000.1).
For example, AK302803.1 is a 1352 nucleotide sequence deposited in 2008; from 888 on is clearly vector -- and the coding region is annotated as 1 to 1275! CAH85743 is a "Plasmodium" protein which is entirely vector derived; again deposited in 2008. PIR (is anybody still curating this?) has a number of vector-derived proteins (e.g. the 231 amino acid "NZ-3 antigen" JC7702; S.pombe beta-lactamase (!) T51301); I was surprised to even find a SwissProt entry that looks like it has pUC-derived sequence
>sp|Q63661.2|MUC4_RAT RecName: Full=Mucin-4; Short=MUC-4; AltName: Full=Pancreatic
adenocarcinoma mucin; AltName: Full=Testis mucin; AltName: Full=Ascites
sialoglycoprotein; Short=ASGP; AltName: Full=Sialomucin
complex; AltName: Full=Pre-sialomucin complex; Short=pSMC;
Contains: RecName: Full=Mucin-4 alpha chain; AltName:
Full=Ascites sialoglycoprotein 1; Short=ASGP-1; Contains: RecName:
Full=Mucin-4 beta chain; AltName: Full=Ascites sialoglycoprotein
2; Short=ASGP-2; Flags: Precursor
Length=2344
GENE ID: 303887 Muc4 | mucin 4, cell surface associated [Rattus norvegicus]
(Over 10 PubMed links)
Score = 46.6 bits (109), Expect = 0.006
Identities = 22/35 (62%), Positives = 25/35 (71%), Gaps = 3/35 (8%)
Frame = -3
pUC19 1427 CCLQTKKPPLPAVVCLPDQELPTLFPKVTGFSRAQ 1323
CCLQTKKPPLPAVVCLPD P+ P + S+ Q
Sbjct 1051 CCLQTKKPPLPAVVCLPD---PSSVPSLMHSSKPQ 1082
Even the RefSeq mRNA section has some very provisional mammalian predicted cDNAs (from chimp) which appear to be polylinker-type sequences from vector (selected restriction sites are marked)
=XbaI= =PstI=
=BamHI =SalI= =PaeI
pUC19 415 GGGGATCCTCTAGAGTCGACCTGCAGGCATG 444
XM_001160101.1 56 GGGGATCCTCTAGAGTCGACCTGCAGGCAT 85
XM_001146903.1 439 GGATCCTCTAGAGTCGACCTGCAGGCATG 467
XM_001141474.1 1503 GGGATCCTCTAGAGTCGACCTGCAGGCA 1530
XM_001141395.1 922 GGGATCCTCTAGAGTCGACCTGCAGGCA 949
Contamination of various sorts has plagued genome projects from the get-go. Perhaps the most notorious was a large deposition of human ESTs which were donated to the public with great fanfare (as a counterpoint to private EST efforts), only to be found later to be rich in yeast sequences. The solution is to run filters -- search everything you do against vectors, E.coli and other common contaminants. In addition, especially in this day-and-age, if your "human" mRNA sequence doesn't match the genome, you've got some 'splaining to do.
What's the harm? Well, when it comes to databases I don't like mess. You always need to check your data, but it's always a nuisance when you actually have to clean it a bunch. Miss something, and some experiment is dirty or worse ruined. Plus, and this is a bit of the theme to my proto-post, some folks haven't yet figured this out & the results are truly ugly. Even worse, these are the obvious problems since bacterial vectors in a eukaryotic sequence truly stick out. Now I'm wondering about all the pUC-like sequences I found in bacterial sources -- can I trust them either?
So, let's all make a it's-still-a-pretty-new-year resolution to recheck our sequencing pipelines. Deliberately throw pUC19 and the E.coli genome through it & see what comes out.
Saturday, January 24, 2009
Earning the right ot put "DNA" in your address
GenomeWeb had an item about real estate developers putting "DNA" in their property names; I had spotted the DNA Lofts in Dorchester but hadn't gotten around to blogging about them (annoying to be scooped, but that's procrastination for you).
However, as far as I can tell the DNA Lofts are just a catchy name, with no actual tie-in. It would be a convenient Red Line ride from the nearby Savin Hill station to the biotech areas of Cambridge. Which is tres disappointing. Surely they could do better by picking something off this list to truly earn a DNA tie-in:
Of course, the best of all -- but quite ambitious -- would be to use a synthetic biology approach to construct the building!
However, as far as I can tell the DNA Lofts are just a catchy name, with no actual tie-in. It would be a convenient Red Line ride from the nearby Savin Hill station to the biotech areas of Cambridge. Which is tres disappointing. Surely they could do better by picking something off this list to truly earn a DNA tie-in:
- Rehabbing space relevant to the history of biotech ("These walls are still contaminated with phage from seminal experiments...")
- Subtle decorative motifs, such as floors tiled with the genetic code table
- Major architectural elements. Double-helical staircases are an obvious one, but how about pyrimidine & purine-shaped windows?
- Under-the counter thermocyclers in the kitchens (and -80 compartments in the freezers), washing machines built by Sorvall, etc.
- Themed common areas: The Topoisomerase Lounge (where you can unwind). The Proteasome recycling center.
Of course, the best of all -- but quite ambitious -- would be to use a synthetic biology approach to construct the building!
Friday, January 23, 2009
Forgetting Occam's Razor
As I've confessed before, one of my recreational vices is the TV show House. It's entertaining enough & Hugh Laurie is really good in the title role and it just relaxes me a bit. I always thought it was harmless, but now I'm wondering.
There is a saying in medicine which has become quite well known thanks to medical shows: If you hear hoof beats, think horses not zebras. In other words, consider the most common cause for a symptom before marching off to explore some rare disease which could cause it. The thing about House is that it doesn't just feature zebras, but giant carnivorous purple-and-orange Martian zebras. Plots either revolve around very unusual diseases or more commonly not so unusual diseases with totally bizarre presentation.
Some nasty GI bug, or perhaps a gang of them, latched onto me last week and while I was much better this week I couldn't quite seem to kick it. So I was off to my internist yesterday in hopes of getting an antibiotic scrip. TNG was along for the ride, also in the process of shaking off a bug. He at least brought some reading material (the apropos, in a macabre fashion, The Hostile Hospital), but I had not. So I was scanning through the waiting room magazines & lo and behold: a copy of New England Journal of Medicine (and recent too!).
I don't regularly read NEJM for the simple reason that most of the articles aren't really in my field: they rarely publish molecular medicine studies, though when they do show up they tend to be huge splashes. So I started skimming the ToC for something interesting & spotted an intriguing headline.
Then it hit me: only a House fan would have parsed that title that way. There was nothing that bizarre going on. One twin: healthy. The other twin: not-healthy. Duh!
There is a saying in medicine which has become quite well known thanks to medical shows: If you hear hoof beats, think horses not zebras. In other words, consider the most common cause for a symptom before marching off to explore some rare disease which could cause it. The thing about House is that it doesn't just feature zebras, but giant carnivorous purple-and-orange Martian zebras. Plots either revolve around very unusual diseases or more commonly not so unusual diseases with totally bizarre presentation.
Some nasty GI bug, or perhaps a gang of them, latched onto me last week and while I was much better this week I couldn't quite seem to kick it. So I was off to my internist yesterday in hopes of getting an antibiotic scrip. TNG was along for the ride, also in the process of shaking off a bug. He at least brought some reading material (the apropos, in a macabre fashion, The Hostile Hospital), but I had not. So I was scanning through the waiting room magazines & lo and behold: a copy of New England Journal of Medicine (and recent too!).
I don't regularly read NEJM for the simple reason that most of the articles aren't really in my field: they rarely publish molecular medicine studies, though when they do show up they tend to be huge splashes. So I started skimming the ToC for something interesting & spotted an intriguing headline.
Hypogonadism Due to Pituicytoma in an Identical TwinBut as I read the short article I became increasingly puzzled as I read it repeatedly: how exactly was the Pituicytoma in one twin causing the hypogonadism in the other twin?
Then it hit me: only a House fan would have parsed that title that way. There was nothing that bizarre going on. One twin: healthy. The other twin: not-healthy. Duh!
Wednesday, January 21, 2009
Where did those gene count estimates come from anyway?
When mentally reviewing what I wrote yesterday about the great human genome gold rush, I realized I hadn't really touched on one of the most curious bits of that. Indeed, it was GenomeWeb's Daily Scan headline on an entry summarizing mine & Derek Lowe's pieces that reminded me of it: All those varying estimates for human gene count.
When the human genome was only partially sequenced, one of my colleagues at Millennium tried to dig through the literature and figure out the best estimate for the number of human genes. Many textbooks & reviews seem to put the number in the 50,000-75,000 range -- my 2nd edition of Alberts et al, Molecular Biology of the Cell from junior year states
The other pre-sequencing methodology that was often cited was DNA reassociation kinetics, an experimental approach which can estimate the fraction of DNA in a genome which is unique and what fraction is repeated. If we assume that genes are only in the unique regions, then knowing the size of the genome and the unique fraction could estimate the amount of space left over for genes.
What my colleague was unable to find, strangely, was any paper which actually declared a gene count as an original result. As far as he could tell, the human genome estimate had popped into being like a quantum particle in a vacuum, and then was repeated. I think it would be a great challenge for someone (or a whole class!) at a university with a good (and still accessible!) collection of the older journals to try to find that first paper, if it does exist.
Now the whole reason for this is that it was useful to have a ballpark figure. For example, if we thought we could find 20K human genes and somebody had a database of 200K human genes, then maybe we were missing out on 75% of the valuable genes -- and should consider buying into a database. Or, if we thought we could find them on our own, it made a difference what we might try to negotiate. If we thought 1% of the genes would fall into classical drug target categories, a 4X difference in gene count could really alter how we would structure deals.
MLNM wasn't a great trafficker in human gene numbers, but many other companies were -- and generally seemed to one-up each other. If Incyte claimed their data showed 150K genes, then HGS might claim 175K and Hyseq 200K (I don't remember precisely who claimed which, though these three were big traffickers in numbers).
So my colleague tried a new approach, which I think was to say: we have a few percent of the human genome sequences (albeit mostly around genes of interest and not randomly sampled). How many genes have been found? And what would that extrapolate out to for the whole genome.
His conclusion was so shocking I admit I refused to believe it at first, and never quite bought into it. I think it was about 25-30K. How could the textbooks be off by 2X-3X? I could believe the other genomics companies might be optimistic in interpreting their data, but could they really be deluding themselves that much??
But, the logic was hard to assault. In order for his estimate to be low by a lot, you would have to posit that the genomic regions sequenced to date were unusually gene poor -- and that the rest of the genome was packed.
Lo and behold, when the genome came in his estimate was shown to be prescient. The textbook numbers were based on very crude techniques, and couldn't really be traced down to an original source to verify the methods or check the various inputs. But, what about all those other companies?
I've never heard any of the high estimaters explain themselves, other than the brief bit of "yeah, the genome's out but y'all missed a lot of stuff" which followed the genome announcements. I have some general guesses, however, based on what I saw in our own work. In general, though, it gets down to all the ways you can be fooled looking solely (or primarily) at EST data.
First, there is the contamination/mistracking problem: some of the DNA in your database isn't what it is supposed to be. The easiest is contamination: some bits of environmental stuff get into your sequencing libraries. The simplest is E.coli and early on there was a scandalous amount of yeast in some public EST libraries, but all sorts of other stuff will show up. One public library had traces of Lactobacillus in it -- which I joked was due to the technician eating yogurt with one hand while preparing the library with the other. I saw at least once a library contaminated with tobacco sequences. Now, many of these were probably mistracking of samples at a facility which processed many different sorts of DNA -- indeed, there was a strong correlation between the type of junk found in an EST library and which facility had made it -- and the junk usually corresponded to another project.
But even stranger laboratory-generated wierdness could result. We had one case at MLNM where nearly every gene in a whole library seemed to be fused to a particular human gene. The most likely explanation we came up with is that the common gene had been sequenced, as a short PCR product, and somehow samples had been mixed or contamination left behind in a well. The strong signal from the PCR product swamped out the EST traces -- until the end of the PCR product was reached & the other signal could now be seen.
Still other wierd artifacts were certainly created during the building of the library -- genomic contamination, ligation of bits of DNA to create chimaeras, etc.
Deeper still, bits of the genome sometimes get transcribed or the transcripts spliced in odd ways. We would find ESTs or EST read pairs (one read from each end of the molecule) which would suggest some strange transcript -- but never be able to detect the transcript by RT-PCR. Now, that doesn't prove it never exists, but it does leave open the possibility that the EST was a one-time wonder.
All of these are rare events, but look through enough data and you will see them. So, my best guess for those overestimates was that everything in these companies database's was fed into a clustering algorithm & every unique cluster was called a gene. Given the perceived value of claiming a bigger database, none of them pushed on their Informatics groups to get error bounds or provide a conservative estimate.
Of course, once the genome showed up the evidence was there to rule out a lot of stuff. Even when the genome was quite unfinished, one of my pet projects was to try to clean the junk out of our database. So, once we started trying to align all our human ESTs (which included public ESTs and Incyte's database) to the genome I started asking: what is the remaining stuff. Some could never be figured out, but more than a little mapped to some other genome -- mouse, rat, fly, worm, E.coli, etc. Some stuff mapped to the human genome -- but onto two different chromosomes or too far apart to make sense. Yes, there could be some interesting stuff there (indeed, someone else did realize this was a way to find interesting stuff), but for our immediate needs we just wanted to toss.
If anyone from one of the other genomics companies would like to dispute what I've written here, I invite them to do so -- I think it is a fascinating part of history which should be captured before it is all forgotten.
When the human genome was only partially sequenced, one of my colleagues at Millennium tried to dig through the literature and figure out the best estimate for the number of human genes. Many textbooks & reviews seem to put the number in the 50,000-75,000 range -- my 2nd edition of Alberts et al, Molecular Biology of the Cell from junior year states
no mammal (or any other organism) is likely to be constructed from more than perhaps 60,000 essential proteins (ignoring for the moment the important consequences of alterative RNA splicing) Thus, from a genetic point of view, humans are unlikely to be more than about 10 times more complex than the fruit fly Drosophila, which is estimate to have about 5000 essential genes.. The argument laid out in this textbook is one based on population genetics & mutation rates, and is basically an upper bound given observed DNA mutation rates and the size of the genome.
The other pre-sequencing methodology that was often cited was DNA reassociation kinetics, an experimental approach which can estimate the fraction of DNA in a genome which is unique and what fraction is repeated. If we assume that genes are only in the unique regions, then knowing the size of the genome and the unique fraction could estimate the amount of space left over for genes.
What my colleague was unable to find, strangely, was any paper which actually declared a gene count as an original result. As far as he could tell, the human genome estimate had popped into being like a quantum particle in a vacuum, and then was repeated. I think it would be a great challenge for someone (or a whole class!) at a university with a good (and still accessible!) collection of the older journals to try to find that first paper, if it does exist.
Now the whole reason for this is that it was useful to have a ballpark figure. For example, if we thought we could find 20K human genes and somebody had a database of 200K human genes, then maybe we were missing out on 75% of the valuable genes -- and should consider buying into a database. Or, if we thought we could find them on our own, it made a difference what we might try to negotiate. If we thought 1% of the genes would fall into classical drug target categories, a 4X difference in gene count could really alter how we would structure deals.
MLNM wasn't a great trafficker in human gene numbers, but many other companies were -- and generally seemed to one-up each other. If Incyte claimed their data showed 150K genes, then HGS might claim 175K and Hyseq 200K (I don't remember precisely who claimed which, though these three were big traffickers in numbers).
So my colleague tried a new approach, which I think was to say: we have a few percent of the human genome sequences (albeit mostly around genes of interest and not randomly sampled). How many genes have been found? And what would that extrapolate out to for the whole genome.
His conclusion was so shocking I admit I refused to believe it at first, and never quite bought into it. I think it was about 25-30K. How could the textbooks be off by 2X-3X? I could believe the other genomics companies might be optimistic in interpreting their data, but could they really be deluding themselves that much??
But, the logic was hard to assault. In order for his estimate to be low by a lot, you would have to posit that the genomic regions sequenced to date were unusually gene poor -- and that the rest of the genome was packed.
Lo and behold, when the genome came in his estimate was shown to be prescient. The textbook numbers were based on very crude techniques, and couldn't really be traced down to an original source to verify the methods or check the various inputs. But, what about all those other companies?
I've never heard any of the high estimaters explain themselves, other than the brief bit of "yeah, the genome's out but y'all missed a lot of stuff" which followed the genome announcements. I have some general guesses, however, based on what I saw in our own work. In general, though, it gets down to all the ways you can be fooled looking solely (or primarily) at EST data.
First, there is the contamination/mistracking problem: some of the DNA in your database isn't what it is supposed to be. The easiest is contamination: some bits of environmental stuff get into your sequencing libraries. The simplest is E.coli and early on there was a scandalous amount of yeast in some public EST libraries, but all sorts of other stuff will show up. One public library had traces of Lactobacillus in it -- which I joked was due to the technician eating yogurt with one hand while preparing the library with the other. I saw at least once a library contaminated with tobacco sequences. Now, many of these were probably mistracking of samples at a facility which processed many different sorts of DNA -- indeed, there was a strong correlation between the type of junk found in an EST library and which facility had made it -- and the junk usually corresponded to another project.
But even stranger laboratory-generated wierdness could result. We had one case at MLNM where nearly every gene in a whole library seemed to be fused to a particular human gene. The most likely explanation we came up with is that the common gene had been sequenced, as a short PCR product, and somehow samples had been mixed or contamination left behind in a well. The strong signal from the PCR product swamped out the EST traces -- until the end of the PCR product was reached & the other signal could now be seen.
Still other wierd artifacts were certainly created during the building of the library -- genomic contamination, ligation of bits of DNA to create chimaeras, etc.
Deeper still, bits of the genome sometimes get transcribed or the transcripts spliced in odd ways. We would find ESTs or EST read pairs (one read from each end of the molecule) which would suggest some strange transcript -- but never be able to detect the transcript by RT-PCR. Now, that doesn't prove it never exists, but it does leave open the possibility that the EST was a one-time wonder.
All of these are rare events, but look through enough data and you will see them. So, my best guess for those overestimates was that everything in these companies database's was fed into a clustering algorithm & every unique cluster was called a gene. Given the perceived value of claiming a bigger database, none of them pushed on their Informatics groups to get error bounds or provide a conservative estimate.
Of course, once the genome showed up the evidence was there to rule out a lot of stuff. Even when the genome was quite unfinished, one of my pet projects was to try to clean the junk out of our database. So, once we started trying to align all our human ESTs (which included public ESTs and Incyte's database) to the genome I started asking: what is the remaining stuff. Some could never be figured out, but more than a little mapped to some other genome -- mouse, rat, fly, worm, E.coli, etc. Some stuff mapped to the human genome -- but onto two different chromosomes or too far apart to make sense. Yes, there could be some interesting stuff there (indeed, someone else did realize this was a way to find interesting stuff), but for our immediate needs we just wanted to toss.
If anyone from one of the other genomics companies would like to dispute what I've written here, I invite them to do so -- I think it is a fascinating part of history which should be captured before it is all forgotten.
Tuesday, January 20, 2009
Ah, them gold rush days!
Derek Lowe had a nice piece yesterday looking back on the genomics bubble. I might quibble with his benchmarking of the end of the insanity -- the stock market bubble would not peak until just before the 2000 elections, but it's a fine piece & pretty accurate.
I should know -- I was there. I was more than just there, I was a significant part of it. No, I didn't think it up & I won't try to exaggerate my importance, but for what is perhaps the poster child of genomics excess (and if not that, certainly in the Pantheon of genomanic deities).
When I got to Millennium they were still largely focused on the positional cloning of disease genes. But, they had started throwing sequencing capacity at ESTs, small bits of genetic message which serve as toeholds to larger ones. The catch was that the sequencing analysis software had been designed for positional cloning work & not ESTs, and it's a very different ballgame. When sequencing genomic DNA seeing anything which looked like a gene was interesting. But when sequencing stuff that is almost nothing but genes, the challenge was to sort the wheat from the chaff. Lots of scientists spent mind-numbing hours scanning BLAST reports for things of interest, and often found things. But this is a lousy technique -- not only might eyes glaze over (or neurons croak) from monotony, but a really interesting match might not be obvious -- what if the top hit was "Uncharacterized protein X" but the 3rd match down was "TotalPharmaceuticalGold"? Or worse, that BLAST couldn't even find a useable match? Plus, was that a match or an identity -- did you find something new or just rediscover a lousy fragment of the old? More mind numbing staring.
Enter a cocky recent Ph.D. After building up some expertise and some more refined tools (which in their embryonic form nailed me the one gene patent of mine perhaps worth something), I had built a system which churned through all the ESTs and crudely organized them by what made things interesting (and tried to ignore all the boring stuff). Ion channels -- look on this web page. GPCRs -- that's over here. Possible secreted proteins, look at this analysis. Furthermore, it also attempted to amalgamate all the different ESTs into a view which was higher quality, longer and more compact -- and tell you which things were already described as proteins and which might be novel. Plus, more sensitive algorithms than BLAST were used to pull things into families.
Now in all honesty, it wasn't nearly perfect. Some of the mind-numbing review had shifted to me -- the early versions in particular had every homology approved (and named!) by me. The semi-automatically generated names were ugly. Various EST artifacts could join webs of unrelated genes into a horrible tangle. But, now there could be reviews of consolidated, pre-analyzed data (though also in fairness nobody ever totally trusted it, so the manual sequence-by-sequence reviews often continued).
Of course, if you have a mountain of loot you probably want to protect it. Enter the lawyers. Millennium had always filed on their discoveries; now they had lots of discoveries to protect. But protect from what? Well, the paranoia was a loss of "Freedom to Operate", usually known as FTO. Nobody knew what would stand up as a patent -- but there were instructive examples from the early biotech era of business plans sunk by a loss of FTO -- and expensive lawsuits that clearly marked that loss. So the patenting engine took off -- an expensive insurance policy against an unpredictable future.
Of course, what the lawyers wanted for the filing was as much info as possible -- and the automated analyses provided lots for them. But, they had been designed to be viewed in a web browser individually, not printed out en masse. Worse yet, by this time Informatics & Legal were in separate buildings -- one of my least pleasant Millennium memories was trying to script the printing a raft of analyses on a printer located in the other building. Plus, if there were inventions then somebody had to have invented them -- such as the person who wrote the code to find them & then reviewed the initial output. And so, I started having dates with the paralegals, an hour of hand-cramping signing of document after document. At one point, there were somewhere between 120-140 patent applications where I was sole or co-inventor.
This was the late 90's and the hype was getting thick -- we were guilty but so were others. Millennium wasn't a big pusher of high gene counts -- at least in the terms of the day (but that's another whole story), but certainly we started selling all those genes we had & the ones we extrapolated were still out there. A key part of the business model was to sell the genes many times -- if we could sell the same gene to Lilly for cardiovascular & Roche for metabolic and AstraZeneca for inflammation, all the better. Not that anything underhanded went on; we'd present the case to each company & most of the deals had exclusivity only within a therapeutic area.
How much did we believe our own Kool Aid? It varied. There was one day where I got in a blue mood because I convinced myself that once MLNM found all the genes we'd put ourselves out of work! But that was an extreme ( and what I hope is the height of my own personal stupidity); most of the time we thought we might be right or we might be overestimating a bunch -- but that our partners were intelligent adults who could make the same calculations. Never did I see an attitude that we were fleecing the suckers.
In particular, I remember one of my colleagues making a comment when the Bayer deal was about to be signed. A premise of that deal is that Millennium would identify proteins which could be easily screened, associate them by multiple means with a plausible role in disease, configure an HTS assay for them -- and then Bayer would quickly get hits from their libraries. Those hits in turn would be used to finish determining whether the protein of interest really played a role in disease. MLNM's (over)confidence in genomics matched by Bayer's (over)confidence in chemistry. My colleague said it was one thing to think up such an idea -- and another to 'go over the cliff' -- and he was nervously surprised that someone else was joining us. He was one of the most sober minded fellows around & wasn't making allusions to
Bayer being foolhardy -- just that we were both taking the leap together. Alas, I didn't think to laugh & reply "The fall will kill you".
The genomics rush, alas, did not end with a huge rush of new drug candidates. We thought we'd get a huge leap in biology -- and we did, but not as big as we thought. Traditional drug development & biology had cleaned out the easy stuff; there weren't tons of hidden gems. The chemical biology concept pretty much disappeared from the Bayer collaboration -- turned out it was long-and-painful to configure all those assays (though we did get them done).
BUT, I will admit to being only a partially reformed genomics fan. We got oversold, and it hurt. Much effort was wasted, and just think of the savings if the patent office had declared that you had to have actual causal function to patent a gene! But, much of what we proposed doing still is worth doing -- or has been done. In some sense the genomics companies were just too early for their own good (though the late entrants such as DeCode haven't fared much better). There are no genomics companies -- yet genomics is everywhere. Basic biology fueled by the genome or the technologies pushed by genomics permeate the drug industry (based on the 2 large pharmas I interviewed at in the year MLNM laid me off & what I can read; constructive dissent on this point is welcomed). Probably no novel small molecule drug development history will be directly pinned back to a 1990's genomics effort -- but also virtually no drugs going forward will have their development unaffected by the knowledge of the genome. Everything is tangled up & confused & merged.
The genomics gold rush was insane & wasteful -- but they were fun times!
I should know -- I was there. I was more than just there, I was a significant part of it. No, I didn't think it up & I won't try to exaggerate my importance, but for what is perhaps the poster child of genomics excess (and if not that, certainly in the Pantheon of genomanic deities).
When I got to Millennium they were still largely focused on the positional cloning of disease genes. But, they had started throwing sequencing capacity at ESTs, small bits of genetic message which serve as toeholds to larger ones. The catch was that the sequencing analysis software had been designed for positional cloning work & not ESTs, and it's a very different ballgame. When sequencing genomic DNA seeing anything which looked like a gene was interesting. But when sequencing stuff that is almost nothing but genes, the challenge was to sort the wheat from the chaff. Lots of scientists spent mind-numbing hours scanning BLAST reports for things of interest, and often found things. But this is a lousy technique -- not only might eyes glaze over (or neurons croak) from monotony, but a really interesting match might not be obvious -- what if the top hit was "Uncharacterized protein X" but the 3rd match down was "TotalPharmaceuticalGold"? Or worse, that BLAST couldn't even find a useable match? Plus, was that a match or an identity -- did you find something new or just rediscover a lousy fragment of the old? More mind numbing staring.
Enter a cocky recent Ph.D. After building up some expertise and some more refined tools (which in their embryonic form nailed me the one gene patent of mine perhaps worth something), I had built a system which churned through all the ESTs and crudely organized them by what made things interesting (and tried to ignore all the boring stuff). Ion channels -- look on this web page. GPCRs -- that's over here. Possible secreted proteins, look at this analysis. Furthermore, it also attempted to amalgamate all the different ESTs into a view which was higher quality, longer and more compact -- and tell you which things were already described as proteins and which might be novel. Plus, more sensitive algorithms than BLAST were used to pull things into families.
Now in all honesty, it wasn't nearly perfect. Some of the mind-numbing review had shifted to me -- the early versions in particular had every homology approved (and named!) by me. The semi-automatically generated names were ugly. Various EST artifacts could join webs of unrelated genes into a horrible tangle. But, now there could be reviews of consolidated, pre-analyzed data (though also in fairness nobody ever totally trusted it, so the manual sequence-by-sequence reviews often continued).
Of course, if you have a mountain of loot you probably want to protect it. Enter the lawyers. Millennium had always filed on their discoveries; now they had lots of discoveries to protect. But protect from what? Well, the paranoia was a loss of "Freedom to Operate", usually known as FTO. Nobody knew what would stand up as a patent -- but there were instructive examples from the early biotech era of business plans sunk by a loss of FTO -- and expensive lawsuits that clearly marked that loss. So the patenting engine took off -- an expensive insurance policy against an unpredictable future.
Of course, what the lawyers wanted for the filing was as much info as possible -- and the automated analyses provided lots for them. But, they had been designed to be viewed in a web browser individually, not printed out en masse. Worse yet, by this time Informatics & Legal were in separate buildings -- one of my least pleasant Millennium memories was trying to script the printing a raft of analyses on a printer located in the other building. Plus, if there were inventions then somebody had to have invented them -- such as the person who wrote the code to find them & then reviewed the initial output. And so, I started having dates with the paralegals, an hour of hand-cramping signing of document after document. At one point, there were somewhere between 120-140 patent applications where I was sole or co-inventor.
This was the late 90's and the hype was getting thick -- we were guilty but so were others. Millennium wasn't a big pusher of high gene counts -- at least in the terms of the day (but that's another whole story), but certainly we started selling all those genes we had & the ones we extrapolated were still out there. A key part of the business model was to sell the genes many times -- if we could sell the same gene to Lilly for cardiovascular & Roche for metabolic and AstraZeneca for inflammation, all the better. Not that anything underhanded went on; we'd present the case to each company & most of the deals had exclusivity only within a therapeutic area.
How much did we believe our own Kool Aid? It varied. There was one day where I got in a blue mood because I convinced myself that once MLNM found all the genes we'd put ourselves out of work! But that was an extreme ( and what I hope is the height of my own personal stupidity); most of the time we thought we might be right or we might be overestimating a bunch -- but that our partners were intelligent adults who could make the same calculations. Never did I see an attitude that we were fleecing the suckers.
In particular, I remember one of my colleagues making a comment when the Bayer deal was about to be signed. A premise of that deal is that Millennium would identify proteins which could be easily screened, associate them by multiple means with a plausible role in disease, configure an HTS assay for them -- and then Bayer would quickly get hits from their libraries. Those hits in turn would be used to finish determining whether the protein of interest really played a role in disease. MLNM's (over)confidence in genomics matched by Bayer's (over)confidence in chemistry. My colleague said it was one thing to think up such an idea -- and another to 'go over the cliff' -- and he was nervously surprised that someone else was joining us. He was one of the most sober minded fellows around & wasn't making allusions to
Bayer being foolhardy -- just that we were both taking the leap together. Alas, I didn't think to laugh & reply "The fall will kill you".
The genomics rush, alas, did not end with a huge rush of new drug candidates. We thought we'd get a huge leap in biology -- and we did, but not as big as we thought. Traditional drug development & biology had cleaned out the easy stuff; there weren't tons of hidden gems. The chemical biology concept pretty much disappeared from the Bayer collaboration -- turned out it was long-and-painful to configure all those assays (though we did get them done).
BUT, I will admit to being only a partially reformed genomics fan. We got oversold, and it hurt. Much effort was wasted, and just think of the savings if the patent office had declared that you had to have actual causal function to patent a gene! But, much of what we proposed doing still is worth doing -- or has been done. In some sense the genomics companies were just too early for their own good (though the late entrants such as DeCode haven't fared much better). There are no genomics companies -- yet genomics is everywhere. Basic biology fueled by the genome or the technologies pushed by genomics permeate the drug industry (based on the 2 large pharmas I interviewed at in the year MLNM laid me off & what I can read; constructive dissent on this point is welcomed). Probably no novel small molecule drug development history will be directly pinned back to a 1990's genomics effort -- but also virtually no drugs going forward will have their development unaffected by the knowledge of the genome. Everything is tangled up & confused & merged.
The genomics gold rush was insane & wasteful -- but they were fun times!
Tuesday, January 06, 2009
Watson's solo discovery of DNA
Well, my memory must be truly failing. No offense to Honest Jim, but I always thought he had a partner in finding the structure of DNA. And didn't some third guy share in the Nobel also? Plus, isn't there some experimentalist that people grouse should have gotten some credit?
But, I stand corrected:
Now, some might warn that the Internet doesn't always have reliable information, but this is from a .edu site (and not some student's personal page either), so it must be right, right?
But, I stand corrected:
In the last 50 years since Watson first discovered the structure of DNA, many advances have been made to enable researchers to study and dissect this macromolecule.
Now, some might warn that the Internet doesn't always have reliable information, but this is from a .edu site (and not some student's personal page either), so it must be right, right?