When the human genome was still terra incognito (or, at least our knowledge of the sequence was something like my view of the world sans my glasses oft mistaken for bulletproof glass), a key question was how many genes were present. It was widely cited by textbooks that the number was somewhere in the 50K-70K range, or perhaps even 100K, and some of the gene database companies such as Incyte and HGS and Hyseq were gleefully proclaiming the number much higher (just think what you are missing without our product!). The number wasn't unimportant. If you had some other estimate of what fraction of genes might be good targets for drug development, then the total number of drug targets was dependent on your estimate of the number of genes -- and drug targets were saleable -- and patentable.
At some point, a clever chap at Millennium decided to try to pin down these estimates. First he went for the textbook numbers, which everyone thought were well reasoned from old DNA melting curve experiments estimating the amount of non-repetitive DNA. Surprisingly, he was unable to find any solid calculation converting one to the other -- for all his searching, it appeared that the human gene estimate had appeared spontaneously like a quantum particle.
Using some other lines of thinking (I actually have a copy of his neat document somewhere, though technically it is a Millennium secret -- nothing just ages out of confidentiality. Silly, isn't that!) he argued from estimates of the gene content of yeast and from what had been found from C.elegans for a new estimate. Now, I couldn't find the flaw in his logic but I couldn't quite get myself to accept the estimate. It was preposterous! Only 30K genes for human?
Well, of course the estimate came in even further south of there. And a new paper from the Broad has nearly nipped that down to 20K even. Alas, the spectacularly endowed Broad wasn't munificent enough to publish with the Open Access option for PNAS, so until I make another pilgrimage to the MIT Library I'm stuck skimming the abstract, supporting materials & GenomeWeb writeup.
In some sense, the analysis is inevitable. It's hard to look at one genome and get an accurate gene estimate, but with so many mammalian genomes it gets easier -- and this paper apparently focused on primate genomes, which we have an amazing number of already. It sounds like they focused on ORFs found in human mRNA data, which at least removes the exon prediction problem.
The paper has the usual caveats. The genome is finished -- but not so finished. Bits and pieces are still getting polished up, and while they are generally dull and monotonous a gene or two might still hide there (the GenomeWeb bit mentions 197 genes found since the 'completion' of the genome which were omitted). The definition of gene is always tricky, generally going along the lines of Humpty Dumpty in Looking Glass: 'When I use a word...it means just what I choose it to mean -- neither more nor less.'. Gene here means protein-coding gene, to the exclusion of the RNA-only genes of seemingly endless flavor that pepper the genome.
The other class of caveat is very short ORFs -- and some very short ORFs do interesting things. For example, many neurotransmitters are synthesized from short ORFs -- and tend to evolve quickly, making it challenging to find them (I know, I tried in my past life).
Will this gene accounting ever end? The number will probably keep twiddling back and forth, but not by huge leaps barring some entirely new class of translational mechanism.
Speaking of genes & accounting, one of the little gags in Mr. Magorium's Wonder Emporium, a bit of movie fluff that is neither harmful nor wonderful, is a word derivation. The title character hires an accountant to assay his monetary worth, and promptly dissects the title: clearly it is a counting mutant. I find mutants more interesting than accountants, but both have their place -- and I never before realized that one was a subset of the other!