Omics! Omics!: Craig Venter Reflections: Small Genomes

In the early 1990s, Craig Venter left the NIH over disputes around the patenting of ESTs. Investors backed him, but then William Haseltine basically pushed him out of direct operation of Human Genome Sciences (HGS) and Venter went to start The Institute for Genome Research, far better known as TIGR. TIGR would attract an amazing array of talent both wet lab and dry lab - a pattern that Venter had established. I believe it was around then that Venter attracted both Clyde Hutchinson and Hamilton Smith to TIGR - two giants of microbiology (Smith had a Nobel Prize!), and they would collaborate with him for the rest of their lives. HGS would have human biology to themselves, but TIGR was free to explore other parts of life domain. And it would be in the sequencing of microbes that Venter and TIGR would radically shake up a world that, as I reviewed in the previous piece, was debating how many levels of physical maps to build.

Nearly every microbial sequencing project at that time was relying on physical maps, in particular cosmid maps. Cosmids are plasmids that can be packaged by lambda phage packaging extract, resulting in a very characteristic insert size around 45kb due to limitations of the lambda phage head. So build a minimal spanning set of these and then sequence each one by whichever strategy you thought was best.

One component of the official US Human Genome effort was to sequence the biotechnological workhorse bacterium Escherichia coli. This had been entrusted to E.coli veteran Fred Blattner at University of Wisconsin, who built a quality team and was working his way in both directions from the origin at 0 minutes to the terminus at 50 minutes. For those not familiar with E.coli lore, a now mostly forgotten method to map genes was via transfer of the entire chromosome using the F episome - the factor which creates a mating pilus in E.coli. F pili will transfer chromosomes to any cell if you try hard enough - there are literature reports of transfers to yeast or to HeLa. But importantly, it takes 100 minutes to transfer an entire E.coli K12 chromosome - if you interrupt the physical pilus structure early (say, by vortexing), then only a portion will have transferred. If you always start at the same place, then the timing of transfer is essentially a linear measure of distance on the chromosome.

Blattner had decided to be very conservative on technology - manual radioactive gels - and a bit innovative in workforce - the reactions and gels would be an undergraduate teaching exercise. As they chunked away, Blattner's group would deposit the sequences so the E.coli community could immediately take advantage. At the same time, Kenn Rudd at the NCBI had applied a number of tools to map nearly all existing E.coli sequences to existing restriction maps, and so could even spot cases where two Genbank deposits actually overlapped as well as possible short gaps. My own contribution to E.coli genome was to recruit a tech in George Church's lab to run primers I designed and sequence these - we closed a few small gaps, some of which were just single digit overlaps.

Blattner's group cleaned up the annotations along the way and checked everything previously deposited against their new data. In at least one case, the old data made no sense whatsoever - until somebody thought to check if it was a strict reverse of the correct sequence. Yup - a sign that the original data was Maxam-Gilbert generated and that can generate data 3'-->5' which can be confusing.

Not everyone loved Blattner's methodical approach. James Watson, while still head of the US genome project, had an automated fluorescent sequencer shipped to Wisconsin without it being asked.

There were other physical map based methods being used. I was part of a collaboration with a biotech company, first called Collaborative Research then Genome Therapeutics Corporation, as well as Institute Pasteur, to sequence the Mycobacteria which cause leprosy and tuberculosis. Institute Pasteur had generated the cosmid maps, CR/GT generated the data, a Harvard Medical School colleague did the assembly, and my pipelines attempted to make sense of it.

A notable exception from the clone based methods was Walter Gilbert's effort at Harvard, which was using the "genome sequencing' method he had developed with George Church. This was pursuing a ~800kb mycoplasma genome. In this method, one divided the source DNA into multiple samples, digested each sample with a different restriction enzyme, and then subjected each digest to Maxam-Gilbert reactions - but critically, no labeling involved. These would be electrophoresed and blotted to membranes. If you designed a probe for near one of the used restriction sites, then probing with that probe in that digest would light up the corresponding sequencing ladder.

The one catch with this approach is you require some sequence to start with - and ideally lots of it. Many start points scattered throughout the genome, since the cycle of oligo synthesis - probing - sequence reading would take many days. Plus, once you got going sequence islands would run into each other, reducing the number of sequencing fronts. So one question with this approach was the right number of random starting points, which were generated by shotgun sequencing and a fluorescent Sanger sequencer, and I do remember lead bioinformatician Steve Smith once musing that maybe it would be best to just try to get complete coverage and use the genome sequencing approach for closing gaps.

Oh, and there was what awaited me when I first rotated in Church's lab in the summer of 1992. He had declared in his original multiplex sequencing paper the intention of sequencing both E.coli and Salmonella, and there was a pile of shotgun reads. I wish I knew the coverage - I think it was only a few fold due to Church deciding that his informatics toolkit wasn't up to the challenge and needed work. So the idea of shotgun sequencing was around, but few were willing to invest their reputations on it.

Venter saw things differently, and importantly (as I recall) just did it and then dropped the bombshell in 1995 - he had simply shotgun sequenced the bacterium Haemophilus influenzae. Then he put a bold underscore by sequencing Mycoplasma genitalium in only a month of data generation. Boom boom!

The quick publication of both in Science was followed by others trying to horn in on his success - some wet-behind-the-ears graduate student wrote up a suggestion that their annotation had missed some genes or perhaps they had some sequence errors. But critically, we were in a new era of how to sequence microbes.

But, no such change in science ever occurs instantly. Once projects are running one way, it can be hard to change. Genome Therapeutics kept plugging through Mycobacterium cosmids as that was the project mandate, but switched over to shotgun sequencing for the first privately sequenced genome, Helicobacter pylori. Both shotgun and physical map methods would persist.

I hadn't thought of this until writing the EST piece on Venter, but now I wonder if it ever crossed the minds of the Genethon EST effort - the one that turned out to be badly contaminated with yeast DNA due to using it as a carrier - to switch to being a yeast whole genome shotgun effort. That definitely never happened, but did they turn it over in their minds?

I had my good friend Claude pull information on how different genomes in the next 10 years were sequenced. Even five years later in 2000, two clone-based assemblies were deposited, with two more in 2001, three in 2002, and four in 2003. And the issues with these weren't always trivial. In 2021 I was comparing an internally generated Aspergillus nidulans shotgun sequence with the public reference for what was believed to be an ancestor of our strain, and there was a sizable deletion in our strain. Except it wasn't - it was an E.coli transposon in the reference sequence!

TIGR definitely punched above their weight from 1995-2005, creating 33 of the 188 assemblies deposited - not quite a fifth.

We now live in a world where shotgun assembly is the norm, powered by long reads. That era didn't really arrive until 2014, with PacBio getting reads in the 8 kilobase range. I know - at Warp Drive Bio in 2012 there was one polyketide cluster I couldn't solve until we had it as two overlapping cosmids, which happened to each be solvable by Illumina sequencing alone. With improved PacBio (I'd love to go back and see what HiFi can do) and then the entrance of Oxford Nanopore long reads, there's only a handful of microbial troublespots. Sequence Saccharomyces with a half decent ONT or PacBio library and you'll close every chromosome - except the ribosomal RNA array within chromosome 12.

Of course, as will be covered in the next piece, microbes weren't the limit of Venter's sequencing ambition. As much as 1995 had a bombshell with Venter's name on it, the biggest burst was yet to come.

Omics! Omics!

Wednesday, June 10, 2026

Craig Venter Reflections: Small Genomes

No comments:

Google meta tag

Get new posts by email: