Wednesday, January 21, 2009

Where did those gene count estimates come from anyway?

When mentally reviewing what I wrote yesterday about the great human genome gold rush, I realized I hadn't really touched on one of the most curious bits of that. Indeed, it was GenomeWeb's Daily Scan headline on an entry summarizing mine & Derek Lowe's pieces that reminded me of it: All those varying estimates for human gene count.

When the human genome was only partially sequenced, one of my colleagues at Millennium tried to dig through the literature and figure out the best estimate for the number of human genes. Many textbooks & reviews seem to put the number in the 50,000-75,000 range -- my 2nd edition of Alberts et al, Molecular Biology of the Cell from junior year states
no mammal (or any other organism) is likely to be constructed from more than perhaps 60,000 essential proteins (ignoring for the moment the important consequences of alterative RNA splicing) Thus, from a genetic point of view, humans are unlikely to be more than about 10 times more complex than the fruit fly Drosophila, which is estimate to have about 5000 essential genes.
. The argument laid out in this textbook is one based on population genetics & mutation rates, and is basically an upper bound given observed DNA mutation rates and the size of the genome.

The other pre-sequencing methodology that was often cited was DNA reassociation kinetics, an experimental approach which can estimate the fraction of DNA in a genome which is unique and what fraction is repeated. If we assume that genes are only in the unique regions, then knowing the size of the genome and the unique fraction could estimate the amount of space left over for genes.

What my colleague was unable to find, strangely, was any paper which actually declared a gene count as an original result. As far as he could tell, the human genome estimate had popped into being like a quantum particle in a vacuum, and then was repeated. I think it would be a great challenge for someone (or a whole class!) at a university with a good (and still accessible!) collection of the older journals to try to find that first paper, if it does exist.

Now the whole reason for this is that it was useful to have a ballpark figure. For example, if we thought we could find 20K human genes and somebody had a database of 200K human genes, then maybe we were missing out on 75% of the valuable genes -- and should consider buying into a database. Or, if we thought we could find them on our own, it made a difference what we might try to negotiate. If we thought 1% of the genes would fall into classical drug target categories, a 4X difference in gene count could really alter how we would structure deals.

MLNM wasn't a great trafficker in human gene numbers, but many other companies were -- and generally seemed to one-up each other. If Incyte claimed their data showed 150K genes, then HGS might claim 175K and Hyseq 200K (I don't remember precisely who claimed which, though these three were big traffickers in numbers).

So my colleague tried a new approach, which I think was to say: we have a few percent of the human genome sequences (albeit mostly around genes of interest and not randomly sampled). How many genes have been found? And what would that extrapolate out to for the whole genome.

His conclusion was so shocking I admit I refused to believe it at first, and never quite bought into it. I think it was about 25-30K. How could the textbooks be off by 2X-3X? I could believe the other genomics companies might be optimistic in interpreting their data, but could they really be deluding themselves that much??

But, the logic was hard to assault. In order for his estimate to be low by a lot, you would have to posit that the genomic regions sequenced to date were unusually gene poor -- and that the rest of the genome was packed.

Lo and behold, when the genome came in his estimate was shown to be prescient. The textbook numbers were based on very crude techniques, and couldn't really be traced down to an original source to verify the methods or check the various inputs. But, what about all those other companies?

I've never heard any of the high estimaters explain themselves, other than the brief bit of "yeah, the genome's out but y'all missed a lot of stuff" which followed the genome announcements. I have some general guesses, however, based on what I saw in our own work. In general, though, it gets down to all the ways you can be fooled looking solely (or primarily) at EST data.

First, there is the contamination/mistracking problem: some of the DNA in your database isn't what it is supposed to be. The easiest is contamination: some bits of environmental stuff get into your sequencing libraries. The simplest is E.coli and early on there was a scandalous amount of yeast in some public EST libraries, but all sorts of other stuff will show up. One public library had traces of Lactobacillus in it -- which I joked was due to the technician eating yogurt with one hand while preparing the library with the other. I saw at least once a library contaminated with tobacco sequences. Now, many of these were probably mistracking of samples at a facility which processed many different sorts of DNA -- indeed, there was a strong correlation between the type of junk found in an EST library and which facility had made it -- and the junk usually corresponded to another project.

But even stranger laboratory-generated wierdness could result. We had one case at MLNM where nearly every gene in a whole library seemed to be fused to a particular human gene. The most likely explanation we came up with is that the common gene had been sequenced, as a short PCR product, and somehow samples had been mixed or contamination left behind in a well. The strong signal from the PCR product swamped out the EST traces -- until the end of the PCR product was reached & the other signal could now be seen.

Still other wierd artifacts were certainly created during the building of the library -- genomic contamination, ligation of bits of DNA to create chimaeras, etc.

Deeper still, bits of the genome sometimes get transcribed or the transcripts spliced in odd ways. We would find ESTs or EST read pairs (one read from each end of the molecule) which would suggest some strange transcript -- but never be able to detect the transcript by RT-PCR. Now, that doesn't prove it never exists, but it does leave open the possibility that the EST was a one-time wonder.

All of these are rare events, but look through enough data and you will see them. So, my best guess for those overestimates was that everything in these companies database's was fed into a clustering algorithm & every unique cluster was called a gene. Given the perceived value of claiming a bigger database, none of them pushed on their Informatics groups to get error bounds or provide a conservative estimate.

Of course, once the genome showed up the evidence was there to rule out a lot of stuff. Even when the genome was quite unfinished, one of my pet projects was to try to clean the junk out of our database. So, once we started trying to align all our human ESTs (which included public ESTs and Incyte's database) to the genome I started asking: what is the remaining stuff. Some could never be figured out, but more than a little mapped to some other genome -- mouse, rat, fly, worm, E.coli, etc. Some stuff mapped to the human genome -- but onto two different chromosomes or too far apart to make sense. Yes, there could be some interesting stuff there (indeed, someone else did realize this was a way to find interesting stuff), but for our immediate needs we just wanted to toss.

If anyone from one of the other genomics companies would like to dispute what I've written here, I invite them to do so -- I think it is a fascinating part of history which should be captured before it is all forgotten.

1 comment:

Steven Salzberg said...

I think this estimate was from Wally Gilbert - in the 1980s he threw out the number 100,000, based on the average size of the few known human genes (about 30000 base pairs) and the estimate of the human genome (3 billion base pairs). Bruce Alberts credited Gilbert in his textbook, Essential Cell Biology, and he also says it was such a nice round number that the media quoted it widely.