Tuesday, June 09, 2026

Craig Venter's Genomics Revolution: Setting the Stage

I am overdue in getting back to my very personal review of how the recently passed J. Craig Venter made a huge impact on genomics science.  But especially after being at London Calling, I worry that many in genomics today just won't have the context to grasp how unusual his approaches were.  I met some amazing graduate students there - and most likely they were born after the joint announcement by Venter and Francis Collins of the human genome initial completion.  So with the risk of writing a "back in my day we walked uphill to school both ways" piece, I wish to sketch a debate that the genomics world was having in the early days of the Human Genome Project.

The most memorable student question I've every heard was in Spring 1993 (or was it 1994?), when I was a teaching fellow for the introductory course in genetics taught by Matt Meselson and the late Bill Gelbart.  One of these two - wish I could remember which - had just covered one of my favorite experiments of early molecular biology: Seymour Benzer's elegant genetic mapping of the r locus in phage.  What was so amazing was that Benzer could calibrate his recombinational mapping scheme and show that it should be very capable of detecting recombination events that were well beyond what he was seeing - he was detecting a hard limit in the spacing of recombination.  In other words, he was performing genetic mapping at nucleotide resolution.  After walking through this, a hand shot up and was acknowledged.  "Why didn't he just sequence it?"

And that is the challenge of understanding historical events with modern understanding.  In a similar vein, we of the digital age are likely to have difficulty grasping how anyone doubted "only" 4 bases of DNA could encode the basis of life.  Of course you could encode almost anything given enough nucleotides - but that sort of thinking in binary or similar codes was just forming around the time of Avery's experiments.

When I first attended the GSAC meeting at Hilton Head in fall of 1992, there was a ferocious debate about how to organize the sequencing of the genome.  It was clear to essentially everyone that you needed a physical map of clones spanning the genome, and then those clones would be sequenced, but how?

Sequencing - the tide was strongly towards Sanger dideoxy chemistry though a few labs were still pursuing Maxam-Gilbert chemical degradation methods - required running many gels.  And generating the data for those gels in most cases required material cloned in pUC or M13 vectors and then grown up.   And that was all very expensive.  Or you could walk along a longer clone using primers designed from the last sequencing read - but that meant making lots of primers as well as tracking which primers should be used on which templates.

Something that might strike a current sequencer as absurd is that due to these cost considerations, but also desiring accuracy, the preferred coverage was three or more precisely two plus one - every base should be covered by two reads on one strand and one on the other.  The different strands rule accounted for certain artifacts, such as electrophoretic compressions, that often had a strand bias.  As an aside, the next time somebody claims Sanger is the gold standard for sequence, ask them about compressions.  If anybody wants to know the true issue with compressions, I can point to a structural biology paper that has several sentences trying to explain away a database sequence that doesn't fit their hypothesis - because that sequence is wrong due almost certainly to compressions.

So how to achieve this?  At the one extreme, one could shotgun sequence a larger clone to around 8X mean coverage, and then use custom primers to cover everywhere that didn't meet the 2+1 rule.  But another approach would be to make a physical map of sequencing vector clones and pick a minimal spanning set.  I privately derided this as "map into the ground" approach, and sometimes the practitioners proposed many layers of it - YACs to BACs and BACs to cosmids and cosmids to lambda and lambda to M13 - or something like that.  Perhaps in my derisive recollection the folly was multiplied.  As a note, I think this is first place where in vitro transposition entered sequencing space, as a proposed method to generate primer pads within larger clones - though I think the proponents were of the maximal mapping crowd.

Mapping had its own pitfalls.  Notably, early on the plan was to make a "sequence ready" map from Yeast Artificial Chromosomes (YACs), and a physical map of the genome was built.  YACs had the advantage of often being many hundreds of kilobases long - the longer the clone ,the fewer required for a minimal spanning set.  But early sequencing of YACs revealed they often rearranged during propagation.  So while the YAC map could be useful for anchoring another clone map, it was completely unsuitable for sequencing.  So new maps of BACs and PACs (Bacterial and P1 (phage) Artificial Chromosomes, respectively) were built that would be suitable sequencing targets.  And great care was taken to ensure these replicated as single copy plasmids, in order to minimize the risk or rearranging.

So that was the world that Craig Venter would plunge into - one that for each project was trying to figure out how many levels of physical clone hierarchy there should be.  How many different types of minimal spanning maps should the community build per organism?  We shall see what Venter's solution to the problem was.


Postscript

I should mention that I covered some of this ground about a decade ago, in the format of a series of counterfactuals in which Venter features regularly


















No comments: